This patchset intends to improve some of the shortcomings of mb allocator
that we had noticed while running various tests and workloads in a
POWERPC machine with 64k block size.
** Problems **
More specifically, we were seeing a sharp drop in performance when the
FS was highly fragmented (64K bs). We noticed that:
Problem 1: prefetch logic seemed to be skipping BLOCK_UNINIT groups
which was resulting in buddy and CR0/1 cache not being initialized for
these even though it could be done without any IO. (Not sure if there
was any history behind this design, do let me know if so).
Problem 2: With a 64K bs FS, we were commonly seeing cases where CR1
would correctly identify a good group but due to very high
fragmentation, complex scan would exit early due to ac->ac_found >
s_mb_max_to_scan, resulting in trimming of the allocated len.
Problem 3: Even though our avg free extent was say 4MB and original
request was merely 1 block of data, mballoc noramlization kept adding
PAs and requesting 8MB chunks. This led to almost all the requests
falling into slower CR 2 and with increased threads, we started seeing
lots of CR3 requests as well.
** How did we address them **
Problem 1 (Patch 8,9): Make ext4_mb_prefetch also call
ext4_read_block_bitmap_nowait() in case of BLOCK_UNINIT, so it can init
the BG and exit early without an IO. Next, fix the calls to
prefetch_fini so these newly init BGs can have their buddy initialised.
Problem 2 (Patch 7): When we come to complex_scan after CR1, my
understanding is that due to free/frag > goal_len, we can be sure that
there is atleast one chunk big enough to accomodate the goal request.
Hence, we can skip the overhead of mb_find_extent() other accounting for
each free extent and just process extents that are big enough.
Problem 3 (Patch 11): To solve this problem, this patchset implements a
new allocation criteria (CR1.5 or CR1_5 in code). The idea is that if
CR1 fails to find a BG, it will jump to CR1.5. Here the flow is as
follows:
* We make an assumption that if CR1 has failed that means none of the
currently cached BGs have a big enough continuous extent to satisfy
our request In this case we fall to CR1.5.
* In CR 1.5, we find the highest available free/frag BGs (from CR1
lists) and trim the PAs to this order so that we can find
a BG without IO overhead of CR2.
* Parallely, prefetch will get in more groups in memory, and as more
and more groups are cached, CR1.5 becomes a better replacement of
CR2. This is because, for example, if all BGs are cahced and we
couldn't find anything in CR0/1, we can assume that no BG has a big
enough continuous free extent and hence CR1.5 can directly trim and
find the next biggest extent we could get. In this scenario, without
CR1.5, we would have continued scanning in CR2 which would have
most probably trimmed the request after scanning for ~200 extents.
CR1.5 results in improved allocation speed at the cost of slightly increased
trimming of the len of blocks allocated.
** Performance Numbers **
Unless stated otherwise, these numbers are from fsmark and fio tests with 64k
BS, 64K pagesize on 100Gi nvme0n1 with nodelalloc. There tests were performed
after the FS was fragmented till Avg Fragment Size == 4MB.
* Test 1: Writing ~40000 files of 64K each in a single directory (64 threads, fsmark)
* Test 2: Same as Test 1 on a 500GiB pmem device with dax
* Test 3: 5Gi write with mix of random and seq writes (fio)
* Test 4: 5Gi sequential writes (fio)
Here:
e = extents scanned
c = cr0 / cr1 / cr1.5 / cr2 / cr3 hits
+─────────+───────────────────────────────────+────────────────────────────────+
| | Unpatched | Patched |
+─────────+───────────────────────────────────+────────────────────────────────+
| Test 1 | 6866 files/s | 13527 files/s |
| | e: 8,188,644 | e: 1,719,725 |
| | c: 381 / 330 / - / 4779 / 35534 | c: 381/ 280 / 33299/ 1000/ 6064|
+─────────+───────────────────────────────────+────────────────────────────────+
| Test 2 | 6927 files/s | 8422 files/s |
| | e: 8,055,911 | e: 261,268 |
| | cr: 1011 / 999 / - / 6153 / 32861 | c: 1721 / 1210 / 38093 / 0 / 0 |
+─────────+───────────────────────────────────+────────────────────────────────+
| Test 3 | 387 MiB/s | 443 MiB/s |
+─────────+───────────────────────────────────+────────────────────────────────+
| Test 4 | 3139 MiB/s | 3180 MiB/s |
+─────────+───────────────────────────────────+────────────────────────────────+
The numbers of same tests with 4k bs 64k pagesize are:
+─────────+────────────────────────────────────+────────────────────────────────+
| | Unpatched | Patched |
+─────────+────────────────────────────────────+────────────────────────────────+
| Test 1 | 21618 files/s | 23528 files/s |
| | e: 8,149,272 | e: 223,013 |
| | c: 34 / 1380 / - / 5624 / 34710 | 34 / 1341 / 40387 / 0 / 0 |
+─────────+───────────────────────────────────+─────────────────────────────────+
| Test 2 | 30739 files/s | 30946 files/s |
| | e: 7,742,853 | e: 2,176,475 |
| | c: 1131 / 2244 / - / 3914 / 34468 | c: 1596/1079/28425/1098/8547 |
+─────────+───────────────────────────────────+─────────────────────────────────+
| Test 3 | 200 MiB/s | 186MiB/s |
+─────────+───────────────────────────────────+─────────────────────────────────+
| Test 4 | 621 MiB/s | 632 MiB/s |
+─────────+────────────────────────────────────+────────────────────────────────+
** Some Observations **
1. In the case of highly fragmented 64k blocksize most of the performance is
lost since we hold the BG lock while scanning a block group for best extent.
As our goal len is 8MB and we only have 4MB blocks, we are taking a long time
to scan causing other threads to wait on the BG lock. This can be seen in perf
diff of unpatched vs patched:
83.14% -24.89% [kernel.vmlinux] [k] do_raw_spin_lock
Using lockstat and perf call graph I was able to confirm that this lock was the
BG lock taken in ext4_mb_regular_allocator, contending with other processes trying
to take the same BG's lock in ext4_mb_regular_allocator() and __ext4_new_inode()
2. Currently, I do see some increase in fragmentation. Below are the
e2freefrag results after Test 1 with 64k BS:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Unpatched:
Min. free extent: 128 KB
Max. free extent: 8000 KB
Avg. free extent: 4096 KB
Num. free extent: 12630
HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range : Free extents Free Blocks Percent
128K... 256K- : 1 2 0.00%
256K... 512K- : 1 6 0.00%
512K... 1024K- : 4 48 0.01%
1M... 2M- : 5 120 0.01%
2M... 4M- : 11947 725624 85.31%
4M... 8M- : 672 83796 9.85%
Patched:
Min. free extent: 64 KB
Max. free extent: 11648 KB
Avg. free extent: 2688 KB
Num. free extent: 18847
HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range : Free extents Free Blocks Percent
64K... 128K- : 1 1 0.00%
128K... 256K- : 2 5 0.00%
256K... 512K- : 1 5 0.00%
512K... 1024K- : 297 3909 0.48%
1M... 2M- : 11221 341065 42.13%
2M... 4M- : 4940 294260 36.35%
4M... 8M- : 2384 170169 21.02%
8M... 16M- : 1 182 0.02%
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
3. I was hoping to get some feedback on enabling prefetch of BLOCK_UNINIT
BGs and any history on why we disabled it.
-------------------------------------
Since these changes are looking good to me from my end, so posting for a
feedback from ext4 community.
(gcexfstests -c all quick went fine with no new failures reported)
Any thoughts/suggestions are welcome!!
Regards,
Ojaswin
Ojaswin Mujoo (8):
ext4: Convert mballoc cr (criteria) to enum
ext4: Add per CR extent scanned counter
ext4: Add counter to track successful allocation of goal length
ext4: Avoid scanning smaller extents in BG during CR1
ext4: Don't skip prefetching BLOCK_UNINIT groups
ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs
ext4: Abstract out logic to search average fragment list
ext4: Add allocation criteria 1.5 (CR1_5)
Ritesh Harjani (IBM) (3):
ext4: mballoc: Remove useless setting of ac_criteria
ext4: Remove unused extern variables declaration
ext4: mballoc: Fix getting the right group desc in
ext4_mb_prefetch_fini
fs/ext4/ext4.h | 23 +++-
fs/ext4/mballoc.c | 284 +++++++++++++++++++++++++++++++++-------------
fs/ext4/mballoc.h | 27 ++++-
fs/ext4/super.c | 11 +-
fs/ext4/sysfs.c | 2 +
5 files changed, 255 insertions(+), 92 deletions(-)
--
2.31.1
ext4_mb_stats & ext4_mb_max_to_scan are never used. We use
sbi->s_mb_stats and sbi->s_mb_max_to_scan instead.
Hence kill these extern declarations.
Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
Signed-off-by: Ojaswin Mujoo <[email protected]>
---
fs/ext4/ext4.h | 2 --
fs/ext4/mballoc.h | 2 +-
2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 140e1eb300d1..b8b00457da8d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2903,8 +2903,6 @@ int ext4_fc_record_regions(struct super_block *sb, int ino,
/* mballoc.c */
extern const struct seq_operations ext4_mb_seq_groups_ops;
extern const struct seq_operations ext4_mb_seq_structs_summary_ops;
-extern long ext4_mb_stats;
-extern long ext4_mb_max_to_scan;
extern int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset);
extern int ext4_mb_init(struct super_block *);
extern int ext4_mb_release(struct super_block *);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index dcda2a943cee..165a17893c81 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -49,7 +49,7 @@
#define MB_DEFAULT_MIN_TO_SCAN 10
/*
- * with 'ext4_mb_stats' allocator will collect stats that will be
+ * with 's_mb_stats' allocator will collect stats that will be
* shown at umount. The collecting costs though!
*/
#define MB_DEFAULT_STATS 0
--
2.31.1
group descriptor and group info are not of the same group in
ext4_mb_prefetch_fini(). This problem was found during code
review/walkthrough and seems like a bug, so fix it.
Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
Signed-off-by: Ojaswin Mujoo <[email protected]>
---
fs/ext4/mballoc.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 572e79a698d4..8b22cc07b054 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2569,14 +2569,14 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
unsigned int nr)
{
- while (nr-- > 0) {
- struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group,
- NULL);
- struct ext4_group_info *grp = ext4_get_group_info(sb, group);
+ struct ext4_group_desc *gdp;
+ struct ext4_group_info *grp;
+ while (nr-- > 0) {
if (!group)
group = ext4_get_groups_count(sb);
group--;
+ gdp = ext4_get_group_desc(sb, group, NULL);
grp = ext4_get_group_info(sb, group);
if (EXT4_MB_GRP_NEED_INIT(grp) &&
--
2.31.1
Convert criteria to be an enum so it easier to maintain. This change
also makes it easier to insert new criterias in the future.
Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/ext4.h | 14 ++++++--
fs/ext4/mballoc.c | 88 +++++++++++++++++++++++------------------------
fs/ext4/mballoc.h | 10 ++++++
3 files changed, 65 insertions(+), 47 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b8b00457da8d..6037b8e0af86 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -126,6 +126,14 @@ enum SHIFT_DIRECTION {
SHIFT_RIGHT,
};
+/*
+ * Number of criterias defined. For each criteria, mballoc has slightly
+ * different way of finding the required blocks nad usually, higher the
+ * criteria the slower the allocation. We start at lower criterias and keep
+ * falling back to higher ones if we are not able to find any blocks.
+ */
+#define EXT4_MB_NUM_CRS 4
+
/*
* Flags used in mballoc's allocation_context flags field.
*
@@ -1631,9 +1639,9 @@ struct ext4_sb_info {
atomic_t s_bal_2orders; /* 2^order hits */
atomic_t s_bal_cr0_bad_suggestions;
atomic_t s_bal_cr1_bad_suggestions;
- atomic64_t s_bal_cX_groups_considered[4];
- atomic64_t s_bal_cX_hits[4];
- atomic64_t s_bal_cX_failed[4]; /* cX loop didn't find blocks */
+ atomic64_t s_bal_cX_groups_considered[EXT4_MB_NUM_CRS];
+ atomic64_t s_bal_cX_hits[EXT4_MB_NUM_CRS];
+ atomic64_t s_bal_cX_failed[EXT4_MB_NUM_CRS]; /* cX loop didn't find blocks */
atomic_t s_mb_buddies_generated; /* number of buddies generated */
atomic64_t s_mb_generation_time;
atomic_t s_mb_lost_chunks;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 8b22cc07b054..323604a2ff45 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -409,7 +409,7 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
- ext4_group_t group, int cr);
+ ext4_group_t group, enum criteria cr);
static int ext4_try_to_trim_range(struct super_block *sb,
struct ext4_buddy *e4b, ext4_grpblk_t start,
@@ -857,7 +857,7 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
* cr level needs an update.
*/
static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
- int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+ enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
{
struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
struct ext4_group_info *iter, *grp;
@@ -882,8 +882,8 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
bb_largest_free_order_node) {
if (sbi->s_mb_stats)
- atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
- if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
+ atomic64_inc(&sbi->s_bal_cX_groups_considered[CR0]);
+ if (likely(ext4_mb_good_group(ac, iter->bb_group, CR0))) {
grp = iter;
break;
}
@@ -895,7 +895,7 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
if (!grp) {
/* Increment cr and search again */
- *new_cr = 1;
+ *new_cr = CR1;
} else {
*group = grp->bb_group;
ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
@@ -907,7 +907,7 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
* order. Updates *new_cr if cr level needs an update.
*/
static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
- int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+ enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
{
struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
struct ext4_group_info *grp = NULL, *iter;
@@ -930,8 +930,8 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
bb_avg_fragment_size_node) {
if (sbi->s_mb_stats)
- atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
- if (likely(ext4_mb_good_group(ac, iter->bb_group, 1))) {
+ atomic64_inc(&sbi->s_bal_cX_groups_considered[CR1]);
+ if (likely(ext4_mb_good_group(ac, iter->bb_group, CR1))) {
grp = iter;
break;
}
@@ -945,7 +945,7 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
*group = grp->bb_group;
ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
} else {
- *new_cr = 2;
+ *new_cr = CR2;
}
}
@@ -953,7 +953,7 @@ static inline int should_optimize_scan(struct ext4_allocation_context *ac)
{
if (unlikely(!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN)))
return 0;
- if (ac->ac_criteria >= 2)
+ if (ac->ac_criteria >= CR2)
return 0;
if (!ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
return 0;
@@ -998,7 +998,7 @@ next_linear_group(struct ext4_allocation_context *ac, int group, int ngroups)
* @ngroups Total number of groups
*/
static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
- int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+ enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
{
*new_cr = ac->ac_criteria;
@@ -1007,9 +1007,9 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
return;
}
- if (*new_cr == 0) {
+ if (*new_cr == CR0) {
ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
- } else if (*new_cr == 1) {
+ } else if (*new_cr == CR1) {
ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
} else {
/*
@@ -2378,13 +2378,13 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
* for the allocation or not.
*/
static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
- ext4_group_t group, int cr)
+ ext4_group_t group, enum criteria cr)
{
ext4_grpblk_t free, fragments;
int flex_size = ext4_flex_bg_size(EXT4_SB(ac->ac_sb));
struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
- BUG_ON(cr < 0 || cr >= 4);
+ BUG_ON(cr < CR0 || cr >= EXT4_MB_NUM_CRS);
if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
return false;
@@ -2398,7 +2398,7 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
return false;
switch (cr) {
- case 0:
+ case CR0:
BUG_ON(ac->ac_2order == 0);
/* Avoid using the first bg of a flexgroup for data files */
@@ -2417,15 +2417,15 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
return false;
return true;
- case 1:
+ case CR1:
if ((free / fragments) >= ac->ac_g_ex.fe_len)
return true;
break;
- case 2:
+ case CR2:
if (free >= ac->ac_g_ex.fe_len)
return true;
break;
- case 3:
+ case CR3:
return true;
default:
BUG();
@@ -2446,7 +2446,7 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
* out"!
*/
static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
- ext4_group_t group, int cr)
+ ext4_group_t group, enum criteria cr)
{
struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
struct super_block *sb = ac->ac_sb;
@@ -2464,7 +2464,7 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
free = grp->bb_free;
if (free == 0)
goto out;
- if (cr <= 2 && free < ac->ac_g_ex.fe_len)
+ if (cr <= CR2 && free < ac->ac_g_ex.fe_len)
goto out;
if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
goto out;
@@ -2479,7 +2479,7 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
ext4_get_group_desc(sb, group, NULL);
int ret;
- /* cr=0/1 is a very optimistic search to find large
+ /* cr=CR0/CR1 is a very optimistic search to find large
* good chunks almost for free. If buddy data is not
* ready, then this optimization makes no sense. But
* we never skip the first block group in a flex_bg,
@@ -2487,7 +2487,7 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
* and we want to make sure we locate metadata blocks
* in the first block group in the flex_bg if possible.
*/
- if (cr < 2 &&
+ if (cr < CR2 &&
(!sbi->s_log_groups_per_flex ||
((group & ((1 << sbi->s_log_groups_per_flex) - 1)) != 0)) &&
!(ext4_has_group_desc_csum(sb) &&
@@ -2593,7 +2593,7 @@ static noinline_for_stack int
ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
{
ext4_group_t prefetch_grp = 0, ngroups, group, i;
- int cr = -1, new_cr;
+ enum criteria cr, new_cr;
int err = 0, first_err = 0;
unsigned int nr = 0, prefetch_ios = 0;
struct ext4_sb_info *sbi;
@@ -2651,13 +2651,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
}
/* Let's just scan groups to find more-less suitable blocks */
- cr = ac->ac_2order ? 0 : 1;
+ cr = ac->ac_2order ? CR0 : CR1;
/*
- * cr == 0 try to get exact allocation,
- * cr == 3 try to get anything
+ * cr == CR0 try to get exact allocation,
+ * cr == CR3 try to get anything
*/
repeat:
- for (; cr < 4 && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
+ for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
ac->ac_criteria = cr;
/*
* searching for the right group start
@@ -2684,7 +2684,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
* spend a lot of time loading imperfect groups
*/
if ((prefetch_grp == group) &&
- (cr > 1 ||
+ (cr > CR1 ||
prefetch_ios < sbi->s_mb_prefetch_limit)) {
unsigned int curr_ios = prefetch_ios;
@@ -2726,9 +2726,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
}
ac->ac_groups_scanned++;
- if (cr == 0)
+ if (cr == CR0)
ext4_mb_simple_scan_group(ac, &e4b);
- else if (cr == 1 && sbi->s_stripe &&
+ else if (cr == CR1 && sbi->s_stripe &&
!(ac->ac_g_ex.fe_len % sbi->s_stripe))
ext4_mb_scan_aligned(ac, &e4b);
else
@@ -2768,7 +2768,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
ac->ac_b_ex.fe_len = 0;
ac->ac_status = AC_STATUS_CONTINUE;
ac->ac_flags |= EXT4_MB_HINT_FIRST;
- cr = 3;
+ cr = CR3;
goto repeat;
}
}
@@ -2891,36 +2891,36 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
seq_printf(seq, "\tgroups_scanned: %u\n", atomic_read(&sbi->s_bal_groups_scanned));
seq_puts(seq, "\tcr0_stats:\n");
- seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[0]));
+ seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR0]));
seq_printf(seq, "\t\tgroups_considered: %llu\n",
- atomic64_read(&sbi->s_bal_cX_groups_considered[0]));
+ atomic64_read(&sbi->s_bal_cX_groups_considered[CR0]));
seq_printf(seq, "\t\tuseless_loops: %llu\n",
- atomic64_read(&sbi->s_bal_cX_failed[0]));
+ atomic64_read(&sbi->s_bal_cX_failed[CR0]));
seq_printf(seq, "\t\tbad_suggestions: %u\n",
atomic_read(&sbi->s_bal_cr0_bad_suggestions));
seq_puts(seq, "\tcr1_stats:\n");
- seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[1]));
+ seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR1]));
seq_printf(seq, "\t\tgroups_considered: %llu\n",
- atomic64_read(&sbi->s_bal_cX_groups_considered[1]));
+ atomic64_read(&sbi->s_bal_cX_groups_considered[CR1]));
seq_printf(seq, "\t\tuseless_loops: %llu\n",
- atomic64_read(&sbi->s_bal_cX_failed[1]));
+ atomic64_read(&sbi->s_bal_cX_failed[CR1]));
seq_printf(seq, "\t\tbad_suggestions: %u\n",
atomic_read(&sbi->s_bal_cr1_bad_suggestions));
seq_puts(seq, "\tcr2_stats:\n");
- seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[2]));
+ seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR2]));
seq_printf(seq, "\t\tgroups_considered: %llu\n",
- atomic64_read(&sbi->s_bal_cX_groups_considered[2]));
+ atomic64_read(&sbi->s_bal_cX_groups_considered[CR2]));
seq_printf(seq, "\t\tuseless_loops: %llu\n",
- atomic64_read(&sbi->s_bal_cX_failed[2]));
+ atomic64_read(&sbi->s_bal_cX_failed[CR2]));
seq_puts(seq, "\tcr3_stats:\n");
- seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[3]));
+ seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR3]));
seq_printf(seq, "\t\tgroups_considered: %llu\n",
- atomic64_read(&sbi->s_bal_cX_groups_considered[3]));
+ atomic64_read(&sbi->s_bal_cX_groups_considered[CR3]));
seq_printf(seq, "\t\tuseless_loops: %llu\n",
- atomic64_read(&sbi->s_bal_cX_failed[3]));
+ atomic64_read(&sbi->s_bal_cX_failed[CR3]));
seq_printf(seq, "\textents_scanned: %u\n", atomic_read(&sbi->s_bal_ex_scanned));
seq_printf(seq, "\t\tgoal_hits: %u\n", atomic_read(&sbi->s_bal_goals));
seq_printf(seq, "\t\t2^n_hits: %u\n", atomic_read(&sbi->s_bal_2orders));
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 165a17893c81..f0087a85e366 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -95,6 +95,16 @@
*/
#define MB_NUM_ORDERS(sb) ((sb)->s_blocksize_bits + 2)
+/*
+ * All possible allocation criterias for mballoc
+ */
+enum criteria {
+ CR0,
+ CR1,
+ CR2,
+ CR3,
+};
+
struct ext4_free_data {
/* this links the free block information from sb_info */
struct list_head efd_list;
--
2.31.1
This gives better visibility into the number of extents scanned in each
particular CR. For example, this information can be used to see how out
block group scanning logic is performing when the BG is fragmented.
Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/mballoc.c | 12 ++++++++++++
fs/ext4/mballoc.h | 1 +
3 files changed, 14 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6037b8e0af86..4ba2c95915eb 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1633,6 +1633,7 @@ struct ext4_sb_info {
atomic_t s_bal_success; /* we found long enough chunks */
atomic_t s_bal_allocated; /* in blocks */
atomic_t s_bal_ex_scanned; /* total extents scanned */
+ atomic_t s_bal_cX_ex_scanned[EXT4_MB_NUM_CRS]; /* total extents scanned */
atomic_t s_bal_groups_scanned; /* number of groups scanned */
atomic_t s_bal_goals; /* goal hits */
atomic_t s_bal_breaks; /* too long searches */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 323604a2ff45..07a50a13751c 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2077,6 +2077,7 @@ static void ext4_mb_measure_extent(struct ext4_allocation_context *ac,
BUG_ON(ac->ac_status != AC_STATUS_CONTINUE);
ac->ac_found++;
+ ac->ac_cX_found[ac->ac_criteria]++;
/*
* The special case - take what you catch first
@@ -2249,6 +2250,7 @@ void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac,
break;
}
ac->ac_found++;
+ ac->ac_cX_found[ac->ac_criteria]++;
ac->ac_b_ex.fe_len = 1 << i;
ac->ac_b_ex.fe_start = k << i;
@@ -2362,6 +2364,7 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
max = mb_find_extent(e4b, i, sbi->s_stripe, &ex);
if (max >= sbi->s_stripe) {
ac->ac_found++;
+ ac->ac_cX_found[ac->ac_criteria]++;
ex.fe_logical = 0xDEADF00D; /* debug value */
ac->ac_b_ex = ex;
ext4_mb_use_best_found(ac, e4b);
@@ -2894,6 +2897,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR0]));
seq_printf(seq, "\t\tgroups_considered: %llu\n",
atomic64_read(&sbi->s_bal_cX_groups_considered[CR0]));
+ seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR0]));
seq_printf(seq, "\t\tuseless_loops: %llu\n",
atomic64_read(&sbi->s_bal_cX_failed[CR0]));
seq_printf(seq, "\t\tbad_suggestions: %u\n",
@@ -2903,6 +2907,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR1]));
seq_printf(seq, "\t\tgroups_considered: %llu\n",
atomic64_read(&sbi->s_bal_cX_groups_considered[CR1]));
+ seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR1]));
seq_printf(seq, "\t\tuseless_loops: %llu\n",
atomic64_read(&sbi->s_bal_cX_failed[CR1]));
seq_printf(seq, "\t\tbad_suggestions: %u\n",
@@ -2912,6 +2917,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR2]));
seq_printf(seq, "\t\tgroups_considered: %llu\n",
atomic64_read(&sbi->s_bal_cX_groups_considered[CR2]));
+ seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR2]));
seq_printf(seq, "\t\tuseless_loops: %llu\n",
atomic64_read(&sbi->s_bal_cX_failed[CR2]));
@@ -2919,6 +2925,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR3]));
seq_printf(seq, "\t\tgroups_considered: %llu\n",
atomic64_read(&sbi->s_bal_cX_groups_considered[CR3]));
+ seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR3]));
seq_printf(seq, "\t\tuseless_loops: %llu\n",
atomic64_read(&sbi->s_bal_cX_failed[CR3]));
seq_printf(seq, "\textents_scanned: %u\n", atomic_read(&sbi->s_bal_ex_scanned));
@@ -4216,7 +4223,12 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
atomic_add(ac->ac_b_ex.fe_len, &sbi->s_bal_allocated);
if (ac->ac_b_ex.fe_len >= ac->ac_o_ex.fe_len)
atomic_inc(&sbi->s_bal_success);
+
atomic_add(ac->ac_found, &sbi->s_bal_ex_scanned);
+ for (int i=0; i<EXT4_MB_NUM_CRS; i++) {
+ atomic_add(ac->ac_cX_found[i], &sbi->s_bal_cX_ex_scanned[i]);
+ }
+
atomic_add(ac->ac_groups_scanned, &sbi->s_bal_groups_scanned);
if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start &&
ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group)
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index f0087a85e366..004b8d163cc9 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -193,6 +193,7 @@ struct ext4_allocation_context {
__u16 ac_groups_scanned;
__u16 ac_groups_linear_remaining;
__u16 ac_found;
+ __u16 ac_cX_found[EXT4_MB_NUM_CRS];
__u16 ac_tail;
__u16 ac_buddy;
__u8 ac_status;
--
2.31.1
Track number of allocations where the length of blocks allocated is equal to the
length of goal blocks (post normalization). This metric could be useful if
making changes to the allocator logic in the future as it could give us
visibility into how often do we trim our requests.
PS: ac_b_ex.fe_len might get modified due to preallocation efforts and
hence we use ac_f_ex.fe_len instead since we want to compare how much the
allocator was able to actually find.
Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/mballoc.c | 3 +++
2 files changed, 4 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4ba2c95915eb..d8fa01e54e81 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1636,6 +1636,7 @@ struct ext4_sb_info {
atomic_t s_bal_cX_ex_scanned[EXT4_MB_NUM_CRS]; /* total extents scanned */
atomic_t s_bal_groups_scanned; /* number of groups scanned */
atomic_t s_bal_goals; /* goal hits */
+ atomic_t s_bal_len_goals; /* len goal hits */
atomic_t s_bal_breaks; /* too long searches */
atomic_t s_bal_2orders; /* 2^order hits */
atomic_t s_bal_cr0_bad_suggestions;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 07a50a13751c..c4ab8f412d32 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2930,6 +2930,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
atomic64_read(&sbi->s_bal_cX_failed[CR3]));
seq_printf(seq, "\textents_scanned: %u\n", atomic_read(&sbi->s_bal_ex_scanned));
seq_printf(seq, "\t\tgoal_hits: %u\n", atomic_read(&sbi->s_bal_goals));
+ seq_printf(seq, "\t\tlen_goal_hits: %u\n", atomic_read(&sbi->s_bal_len_goals));
seq_printf(seq, "\t\t2^n_hits: %u\n", atomic_read(&sbi->s_bal_2orders));
seq_printf(seq, "\t\tbreaks: %u\n", atomic_read(&sbi->s_bal_breaks));
seq_printf(seq, "\t\tlost: %u\n", atomic_read(&sbi->s_mb_lost_chunks));
@@ -4233,6 +4234,8 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start &&
ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group)
atomic_inc(&sbi->s_bal_goals);
+ if (ac->ac_f_ex.fe_len == ac->ac_g_ex.fe_len)
+ atomic_inc(&sbi->s_bal_len_goals);
if (ac->ac_found > sbi->s_mb_max_to_scan)
atomic_inc(&sbi->s_bal_breaks);
}
--
2.31.1
When we are inside ext4_mb_complex_scan_group() in CR1, we can be sure
that this group has atleast 1 big enough continuous free extent to satisfy
our request because (free / fragments) > goal length.
Hence, instead of wasting time looping over smaller free extents, only
try to consider the free extent if we are sure that it has enough
continuous free space to satisfy goal length. This is particularly
useful when scanning highly fragmented BGs in CR1 as, without this
patch, the allocator might stop scanning early before reaching the big
enough free extent (due to ac_found > mb_max_to_scan) which causes us to
uncessarily trim the request.
Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/mballoc.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index c4ab8f412d32..14529d2fe65f 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2279,7 +2279,7 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
struct super_block *sb = ac->ac_sb;
void *bitmap = e4b->bd_bitmap;
struct ext4_free_extent ex;
- int i;
+ int i, j, freelen;
int free;
free = e4b->bd_info->bb_free;
@@ -2306,6 +2306,23 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
break;
}
+ if (ac->ac_criteria < CR2) {
+ /*
+ * In CR1, we are sure that this group will
+ * have a large enough continuous free extent, so skip
+ * over the smaller free extents
+ */
+ j = mb_find_next_bit(bitmap,
+ EXT4_CLUSTERS_PER_GROUP(sb), i);
+ freelen = j - i;
+
+ if (freelen < ac->ac_g_ex.fe_len) {
+ i = j;
+ free -= freelen;
+ continue;
+ }
+ }
+
mb_find_extent(e4b, i, ac->ac_g_ex.fe_len, &ex);
if (WARN_ON(ex.fe_len <= 0))
break;
--
2.31.1
Currently, ext4_mb_prefetch() and ext4_mb_prefetch_fini() skip
BLOCK_UNINIT groups since fetching their bitmaps doesn't need disk IO.
As a consequence, we end not initializing the buddy structures and CR0/1
lists for these BGs, even though it can be done without any disk IO
overhead. Hence, don't skip such BGs during prefetch and prefetch_fini.
This improves the accuracy of CR0/1 allocation as earlier, we could have
essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy
not being initialized, leading to slower CR2 allocations. With this patch CR0/1
will be able to discover these groups as well, thus improving performance.
Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/mballoc.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 14529d2fe65f..48726a831264 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2557,9 +2557,7 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
*/
if (!EXT4_MB_GRP_TEST_AND_SET_READ(grp) &&
EXT4_MB_GRP_NEED_INIT(grp) &&
- ext4_free_group_clusters(sb, gdp) > 0 &&
- !(ext4_has_group_desc_csum(sb) &&
- (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
+ ext4_free_group_clusters(sb, gdp) > 0 ) {
bh = ext4_read_block_bitmap_nowait(sb, group, true);
if (bh && !IS_ERR(bh)) {
if (!buffer_uptodate(bh) && cnt)
@@ -2600,9 +2598,7 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
grp = ext4_get_group_info(sb, group);
if (EXT4_MB_GRP_NEED_INIT(grp) &&
- ext4_free_group_clusters(sb, gdp) > 0 &&
- !(ext4_has_group_desc_csum(sb) &&
- (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
+ ext4_free_group_clusters(sb, gdp) > 0) {
if (ext4_mb_init_group(sb, group, GFP_NOFS))
break;
}
--
2.31.1
Before this patch, the call stack in ext4_run_li_request is as follows:
/*
* nr = no. of BGs we want to fetch (=s_mb_prefetch)
* prefetch_ios = no. of BGs not uptodate after
* ext4_read_block_bitmap_nowait()
*/
next_group = ext4_mb_prefetch(sb, group, nr, prefetch_ios);
ext4_mb_prefetch_fini(sb, next_group prefetch_ios);
ext4_mb_prefetch_fini() will only try to initialize buddies for BGs in
range [next_group - prefetch_ios, next_group). This is incorrect since
sometimes (prefetch_ios < nr), which causes ext4_mb_prefetch_fini() to
incorrectly ignore some of the BGs that might need initialization. This
issue is more notable now with the previous patch enabling "fetching" of
BLOCK_UNINIT BGs which are marked buffer_uptodate by default.
Fix this by passing nr to ext4_mb_prefetch_fini() instead of
prefetch_ios so that it considers the right range of groups.
Similarly, make sure we don't pass nr=0 to ext4_mb_prefetch_fini() in
ext4_mb_regular_allocator() since we might have prefetched BLOCK_UNINIT
groups that would need buddy initialization.
Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/mballoc.c | 4 ----
fs/ext4/super.c | 11 ++++-------
2 files changed, 4 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 48726a831264..410c9636907b 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2702,8 +2702,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
if ((prefetch_grp == group) &&
(cr > CR1 ||
prefetch_ios < sbi->s_mb_prefetch_limit)) {
- unsigned int curr_ios = prefetch_ios;
-
nr = sbi->s_mb_prefetch;
if (ext4_has_feature_flex_bg(sb)) {
nr = 1 << sbi->s_log_groups_per_flex;
@@ -2712,8 +2710,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
}
prefetch_grp = ext4_mb_prefetch(sb, group,
nr, &prefetch_ios);
- if (prefetch_ios == curr_ios)
- nr = 0;
}
/* This now checks without needing the buddy page */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 72ead3b56706..9dbb09cfc8f7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3636,16 +3636,13 @@ static int ext4_run_li_request(struct ext4_li_request *elr)
ext4_group_t group = elr->lr_next_group;
unsigned int prefetch_ios = 0;
int ret = 0;
+ int nr = EXT4_SB(sb)->s_mb_prefetch;
u64 start_time;
if (elr->lr_mode == EXT4_LI_MODE_PREFETCH_BBITMAP) {
- elr->lr_next_group = ext4_mb_prefetch(sb, group,
- EXT4_SB(sb)->s_mb_prefetch, &prefetch_ios);
- if (prefetch_ios)
- ext4_mb_prefetch_fini(sb, elr->lr_next_group,
- prefetch_ios);
- trace_ext4_prefetch_bitmaps(sb, group, elr->lr_next_group,
- prefetch_ios);
+ elr->lr_next_group = ext4_mb_prefetch(sb, group, nr, &prefetch_ios);
+ ext4_mb_prefetch_fini(sb, elr->lr_next_group, nr);
+ trace_ext4_prefetch_bitmaps(sb, group, elr->lr_next_group, nr);
if (group >= elr->lr_next_group) {
ret = 1;
if (elr->lr_first_not_zeroed != ngroups &&
--
2.31.1
Make the logic of searching average fragment list of a given order reusable
by abstracting it out to a differnet function. This will also avoid
code duplication in upcoming patches.
No functional changes.
Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/mballoc.c | 51 ++++++++++++++++++++++++++++++-----------------
1 file changed, 33 insertions(+), 18 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 410c9636907b..1ce1174aea52 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -902,6 +902,37 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
}
}
+/*
+ * Find a suitable group of given order from the average fragments list.
+ */
+static struct ext4_group_info *
+ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int order)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ struct list_head *frag_list = &sbi->s_mb_avg_fragment_size[order];
+ rwlock_t *frag_list_lock = &sbi->s_mb_avg_fragment_size_locks[order];
+ struct ext4_group_info *grp = NULL, *iter;
+ enum criteria cr = ac->ac_criteria;
+
+ if (list_empty(frag_list))
+ return NULL;
+ read_lock(frag_list_lock);
+ if (list_empty(frag_list)) {
+ read_unlock(frag_list_lock);
+ return NULL;
+ }
+ list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) {
+ if (sbi->s_mb_stats)
+ atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
+ if (likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
+ grp = iter;
+ break;
+ }
+ }
+ read_unlock(frag_list_lock);
+ return grp;
+}
+
/*
* Choose next group by traversing average fragment size list of suitable
* order. Updates *new_cr if cr level needs an update.
@@ -910,7 +941,7 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
{
struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
- struct ext4_group_info *grp = NULL, *iter;
+ struct ext4_group_info *grp = NULL;
int i;
if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
@@ -920,23 +951,7 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
i < MB_NUM_ORDERS(ac->ac_sb); i++) {
- if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
- continue;
- read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
- if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
- read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
- continue;
- }
- list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
- bb_avg_fragment_size_node) {
- if (sbi->s_mb_stats)
- atomic64_inc(&sbi->s_bal_cX_groups_considered[CR1]);
- if (likely(ext4_mb_good_group(ac, iter->bb_group, CR1))) {
- grp = iter;
- break;
- }
- }
- read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
+ grp = ext4_mb_find_good_group_avg_frag_lists(ac, i);
if (grp)
break;
}
--
2.31.1
CR1_5 aims to optimize allocations which can't be satisfied in CR1. The
fact that we couldn't find a group in CR1 suggests that it would be
difficult to find a continuous extent to compleltely satisfy our
allocations. So before falling to the slower CR2, in CR1.5 we
proactively trim the the preallocations so we can find a group with
(free / fragments) big enough. This speeds up our allocation at the
cost of slightly reduced preallocation.
The patch also adds a new sysfs tunable:
* /sys/fs/ext4/<partition>/mb_cr1_5_max_trim_order
This controls how much CR1.5 can trim a request before falling to CR2.
For example, for a request of order 7 and max trim order 2, CR1.5 can
trim this upto order 5.
Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/ext4.h | 7 +++-
fs/ext4/mballoc.c | 97 ++++++++++++++++++++++++++++++++++++++++++++---
fs/ext4/mballoc.h | 14 +++++++
fs/ext4/sysfs.c | 2 +
4 files changed, 113 insertions(+), 7 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d8fa01e54e81..879aac5e39a9 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -132,7 +132,7 @@ enum SHIFT_DIRECTION {
* criteria the slower the allocation. We start at lower criterias and keep
* falling back to higher ones if we are not able to find any blocks.
*/
-#define EXT4_MB_NUM_CRS 4
+#define EXT4_MB_NUM_CRS 5
/*
* Flags used in mballoc's allocation_context flags field.
@@ -175,6 +175,9 @@ enum SHIFT_DIRECTION {
#define EXT4_MB_CR0_OPTIMIZED 0x8000
/* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
#define EXT4_MB_CR1_OPTIMIZED 0x00010000
+/* Avg fragment size rb tree lookup succeeded at least once for cr = 1.5 */
+#define EXT4_MB_CR1_5_OPTIMIZED 0x00020000
+
struct ext4_allocation_request {
/* target inode for block we're allocating */
struct inode *inode;
@@ -1627,6 +1630,7 @@ struct ext4_sb_info {
unsigned long s_mb_last_start;
unsigned int s_mb_prefetch;
unsigned int s_mb_prefetch_limit;
+ unsigned int s_mb_cr1_5_max_trim_order;
/* stats for buddy allocator */
atomic_t s_bal_reqs; /* number of reqs with len > 1 */
@@ -1641,6 +1645,7 @@ struct ext4_sb_info {
atomic_t s_bal_2orders; /* 2^order hits */
atomic_t s_bal_cr0_bad_suggestions;
atomic_t s_bal_cr1_bad_suggestions;
+ atomic_t s_bal_cr1_5_bad_suggestions;
atomic64_t s_bal_cX_groups_considered[EXT4_MB_NUM_CRS];
atomic64_t s_bal_cX_hits[EXT4_MB_NUM_CRS];
atomic64_t s_bal_cX_failed[EXT4_MB_NUM_CRS]; /* cX loop didn't find blocks */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 1ce1174aea52..8e9032f94966 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -960,6 +960,67 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
*group = grp->bb_group;
ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
} else {
+ *new_cr = CR1_5;
+ }
+}
+
+/*
+ * We couldn't find a group in CR1 so try to find the highest free fragment
+ * order we have and proactively trim the goal request length to that order to
+ * find a suitable group faster.
+ *
+ * This optimizes allocation speed at the cost of slightly reduced
+ * preallocations. However, we make sure that we don't trim the request too
+ * much and fall to CR2 in that case.
+ */
+static void ext4_mb_choose_next_group_cr1_5(struct ext4_allocation_context *ac,
+ enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ struct ext4_group_info *grp = NULL;
+ int i, order, min_order;
+
+ if (unlikely(ac->ac_flags & EXT4_MB_CR1_5_OPTIMIZED)) {
+ if (sbi->s_mb_stats)
+ atomic_inc(&sbi->s_bal_cr1_5_bad_suggestions);
+ }
+
+ /*
+ * mb_avg_fragment_size_order() returns order in a way that makes
+ * retrieving back the length using (1 << order) inaccurate. Hence, use
+ * fls() instead since we need to know the actual length while modifying
+ * goal length.
+ */
+ order = fls(ac->ac_g_ex.fe_len);
+ min_order = order - sbi->s_mb_cr1_5_max_trim_order;
+ if (min_order < 0)
+ min_order = 0;
+
+ for (i = order; i >= min_order; i--) {
+ if (ac->ac_o_ex.fe_len <= (1 << i)) {
+ /*
+ * Scale down goal len to make sure we find something
+ * in the free fragments list. Basically, reduce
+ * preallocations.
+ */
+ ac->ac_g_ex.fe_len = 1 << i;
+ } else {
+ break;
+ }
+
+ grp = ext4_mb_find_good_group_avg_frag_lists(ac,
+ mb_avg_fragment_size_order(ac->ac_sb,
+ ac->ac_g_ex.fe_len));
+ if (grp)
+ break;
+ }
+
+ if (grp) {
+ *group = grp->bb_group;
+ ac->ac_flags |= EXT4_MB_CR1_5_OPTIMIZED;
+ } else {
+ /* Reset goal length to original goal length before falling into CR2 */
+ ac->ac_g_ex.fe_len = ac->ac_orig_goal_len;
*new_cr = CR2;
}
}
@@ -1026,6 +1087,8 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
} else if (*new_cr == CR1) {
ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
+ } else if (*new_cr == CR1_5) {
+ ext4_mb_choose_next_group_cr1_5(ac, new_cr, group, ngroups);
} else {
/*
* TODO: For CR=2, we can arrange groups in an rb tree sorted by
@@ -2323,7 +2386,7 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
if (ac->ac_criteria < CR2) {
/*
- * In CR1, we are sure that this group will
+ * In CR1 and CR1_5, we are sure that this group will
* have a large enough continuous free extent, so skip
* over the smaller free extents
*/
@@ -2453,6 +2516,7 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
return true;
case CR1:
+ case CR1_5:
if ((free / fragments) >= ac->ac_g_ex.fe_len)
return true;
break;
@@ -2715,7 +2779,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
* spend a lot of time loading imperfect groups
*/
if ((prefetch_grp == group) &&
- (cr > CR1 ||
+ (cr > CR1_5 ||
prefetch_ios < sbi->s_mb_prefetch_limit)) {
nr = sbi->s_mb_prefetch;
if (ext4_has_feature_flex_bg(sb)) {
@@ -2755,8 +2819,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
ac->ac_groups_scanned++;
if (cr == CR0)
ext4_mb_simple_scan_group(ac, &e4b);
- else if (cr == CR1 && sbi->s_stripe &&
- !(ac->ac_g_ex.fe_len % sbi->s_stripe))
+ else if ((cr == CR1 || cr == CR1_5) && sbi->s_stripe &&
+ !(ac->ac_g_ex.fe_len % sbi->s_stripe))
ext4_mb_scan_aligned(ac, &e4b);
else
ext4_mb_complex_scan_group(ac, &e4b);
@@ -2770,6 +2834,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
/* Processed all groups and haven't found blocks */
if (sbi->s_mb_stats && i == ngroups)
atomic64_inc(&sbi->s_bal_cX_failed[cr]);
+
+ if (i == ngroups && ac->ac_criteria == CR1_5)
+ /* Reset goal length to original goal length before
+ * falling into CR2 */
+ ac->ac_g_ex.fe_len = ac->ac_orig_goal_len;
}
if (ac->ac_b_ex.fe_len > 0 && ac->ac_status != AC_STATUS_FOUND &&
@@ -2937,6 +3006,16 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
seq_printf(seq, "\t\tbad_suggestions: %u\n",
atomic_read(&sbi->s_bal_cr1_bad_suggestions));
+ seq_puts(seq, "\tcr1.5_stats:\n");
+ seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR1_5]));
+ seq_printf(seq, "\t\tgroups_considered: %llu\n",
+ atomic64_read(&sbi->s_bal_cX_groups_considered[CR1_5]));
+ seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR1_5]));
+ seq_printf(seq, "\t\tuseless_loops: %llu\n",
+ atomic64_read(&sbi->s_bal_cX_failed[CR1_5]));
+ seq_printf(seq, "\t\tbad_suggestions: %u\n",
+ atomic_read(&sbi->s_bal_cr1_5_bad_suggestions));
+
seq_puts(seq, "\tcr2_stats:\n");
seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR2]));
seq_printf(seq, "\t\tgroups_considered: %llu\n",
@@ -3452,6 +3531,8 @@ int ext4_mb_init(struct super_block *sb)
sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD;
sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS;
sbi->s_mb_max_inode_prealloc = MB_DEFAULT_MAX_INODE_PREALLOC;
+ sbi->s_mb_cr1_5_max_trim_order = MB_DEFAULT_CR1_5_TRIM_ORDER;
+
/*
* The default group preallocation is 512, which for 4k block
* sizes translates to 2 megabytes. However for bigalloc file
@@ -4218,6 +4299,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
* placement or satisfy big request as is */
ac->ac_g_ex.fe_logical = start;
ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size);
+ ac->ac_orig_goal_len = ac->ac_g_ex.fe_len;
/* define goal start in order to merge */
if (ar->pright && (ar->lright == (start + size))) {
@@ -4258,8 +4340,10 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start &&
ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group)
atomic_inc(&sbi->s_bal_goals);
- if (ac->ac_f_ex.fe_len == ac->ac_g_ex.fe_len)
+ /* did we allocate as much as normalizer originally wanted? */
+ if (ac->ac_f_ex.fe_len == ac->ac_orig_goal_len)
atomic_inc(&sbi->s_bal_len_goals);
+
if (ac->ac_found > sbi->s_mb_max_to_scan)
atomic_inc(&sbi->s_bal_breaks);
}
@@ -4652,7 +4736,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
pa = ac->ac_pa;
- if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) {
+ if (ac->ac_b_ex.fe_len < ac->ac_orig_goal_len) {
int winl;
int wins;
int win;
@@ -5281,6 +5365,7 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
ac->ac_o_ex.fe_start = block;
ac->ac_o_ex.fe_len = len;
ac->ac_g_ex = ac->ac_o_ex;
+ ac->ac_orig_goal_len = ac->ac_g_ex.fe_len;
ac->ac_flags = ar->flags;
/* we have to define context: we'll work with a file or
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 004b8d163cc9..c1b0bf2f6f4d 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -90,6 +90,13 @@
*/
#define MB_DEFAULT_LINEAR_SCAN_THRESHOLD 16
+/*
+ * The maximum order upto which CR1.5 can trim a particular allocation request.
+ * Example, if we have an order 7 request and max trim order of 3, CR1.5 can
+ * trim this upto order 4.
+ */
+#define MB_DEFAULT_CR1_5_TRIM_ORDER 3
+
/*
* Number of valid buddy orders
*/
@@ -101,6 +108,7 @@
enum criteria {
CR0,
CR1,
+ CR1_5,
CR2,
CR3,
};
@@ -188,6 +196,12 @@ struct ext4_allocation_context {
/* copy of the best found extent taken before preallocation efforts */
struct ext4_free_extent ac_f_ex;
+ /*
+ * goal len can change in CR1.5, so save the original len. This is
+ * used while adjusting the PA window and for accounting.
+ */
+ ext4_grpblk_t ac_orig_goal_len;
+
__u32 ac_groups_considered;
__u32 ac_flags; /* allocation hints */
__u16 ac_groups_scanned;
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index d233c24ea342..5ba884a0246e 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -224,6 +224,7 @@ EXT4_RW_ATTR_SBI_UI(warning_ratelimit_interval_ms, s_warning_ratelimit_state.int
EXT4_RW_ATTR_SBI_UI(warning_ratelimit_burst, s_warning_ratelimit_state.burst);
EXT4_RW_ATTR_SBI_UI(msg_ratelimit_interval_ms, s_msg_ratelimit_state.interval);
EXT4_RW_ATTR_SBI_UI(msg_ratelimit_burst, s_msg_ratelimit_state.burst);
+EXT4_RW_ATTR_SBI_UI(mb_cr1_5_max_trim_order, s_mb_cr1_5_max_trim_order);
#ifdef CONFIG_EXT4_DEBUG
EXT4_RW_ATTR_SBI_UL(simulate_fail, s_simulate_fail);
#endif
@@ -275,6 +276,7 @@ static struct attribute *ext4_attrs[] = {
ATTR_LIST(warning_ratelimit_burst),
ATTR_LIST(msg_ratelimit_interval_ms),
ATTR_LIST(msg_ratelimit_burst),
+ ATTR_LIST(mb_cr1_5_max_trim_order),
ATTR_LIST(errors_count),
ATTR_LIST(warning_count),
ATTR_LIST(msg_count),
--
2.31.1
On Fri 27-01-23 18:07:29, Ojaswin Mujoo wrote:
> ext4_mb_stats & ext4_mb_max_to_scan are never used. We use
> sbi->s_mb_stats and sbi->s_mb_max_to_scan instead.
> Hence kill these extern declarations.
>
> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
Nice. Feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/ext4/ext4.h | 2 --
> fs/ext4/mballoc.h | 2 +-
> 2 files changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 140e1eb300d1..b8b00457da8d 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2903,8 +2903,6 @@ int ext4_fc_record_regions(struct super_block *sb, int ino,
> /* mballoc.c */
> extern const struct seq_operations ext4_mb_seq_groups_ops;
> extern const struct seq_operations ext4_mb_seq_structs_summary_ops;
> -extern long ext4_mb_stats;
> -extern long ext4_mb_max_to_scan;
> extern int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset);
> extern int ext4_mb_init(struct super_block *);
> extern int ext4_mb_release(struct super_block *);
> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> index dcda2a943cee..165a17893c81 100644
> --- a/fs/ext4/mballoc.h
> +++ b/fs/ext4/mballoc.h
> @@ -49,7 +49,7 @@
> #define MB_DEFAULT_MIN_TO_SCAN 10
>
> /*
> - * with 'ext4_mb_stats' allocator will collect stats that will be
> + * with 's_mb_stats' allocator will collect stats that will be
> * shown at umount. The collecting costs though!
> */
> #define MB_DEFAULT_STATS 0
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 27-01-23 18:07:30, Ojaswin Mujoo wrote:
> group descriptor and group info are not of the same group in
> ext4_mb_prefetch_fini(). This problem was found during code
> review/walkthrough and seems like a bug, so fix it.
>
> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/ext4/mballoc.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 572e79a698d4..8b22cc07b054 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2569,14 +2569,14 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
> void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
> unsigned int nr)
> {
> - while (nr-- > 0) {
> - struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group,
> - NULL);
> - struct ext4_group_info *grp = ext4_get_group_info(sb, group);
> + struct ext4_group_desc *gdp;
> + struct ext4_group_info *grp;
>
> + while (nr-- > 0) {
> if (!group)
> group = ext4_get_groups_count(sb);
> group--;
> + gdp = ext4_get_group_desc(sb, group, NULL);
> grp = ext4_get_group_info(sb, group);
>
> if (EXT4_MB_GRP_NEED_INIT(grp) &&
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 27-01-23 18:07:31, Ojaswin Mujoo wrote:
> Convert criteria to be an enum so it easier to maintain. This change
> also makes it easier to insert new criterias in the future.
>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
Just two small comments below:
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index b8b00457da8d..6037b8e0af86 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -126,6 +126,14 @@ enum SHIFT_DIRECTION {
> SHIFT_RIGHT,
> };
>
> +/*
> + * Number of criterias defined. For each criteria, mballoc has slightly
> + * different way of finding the required blocks nad usually, higher the
^^^ and
> + * criteria the slower the allocation. We start at lower criterias and keep
> + * falling back to higher ones if we are not able to find any blocks.
> + */
> +#define EXT4_MB_NUM_CRS 4
> +
So defining this in a different header than the enum itself is fragile. I
understand you need it in ext4_sb_info declaration so probably I'd move the
enum declaration to ext4.h. Alternatively I suppose we could move a lot of
mballoc stuff out of ext4_sb_info into a separate struct because there's a
lot of it. But that would be much larger undertaking.
Also when going for symbolic allocator scan names maybe we could actually
make names sensible instead of CR[0-4]? Perhaps like CR_ORDER2_ALIGNED,
CR_BEST_LENGHT_FAST, CR_BEST_LENGTH_ALL, CR_ANY_FREE. And probably we could
deal with ordered comparisons like in:
if (cr < 2 &&
(!sbi->s_log_groups_per_flex ||
((group & ((1 << sbi->s_log_groups_per_flex) - 1)) != 0)) &
!(ext4_has_group_desc_csum(sb) &&
(gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))))
return 0;
to declare CR_FAST_SCAN = 2, or something like that. What do you think?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 27-01-23 18:07:32, Ojaswin Mujoo wrote:
> This gives better visibility into the number of extents scanned in each
> particular CR. For example, this information can be used to see how out
> block group scanning logic is performing when the BG is fragmented.
>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/ext4/ext4.h | 1 +
> fs/ext4/mballoc.c | 12 ++++++++++++
> fs/ext4/mballoc.h | 1 +
> 3 files changed, 14 insertions(+)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 6037b8e0af86..4ba2c95915eb 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1633,6 +1633,7 @@ struct ext4_sb_info {
> atomic_t s_bal_success; /* we found long enough chunks */
> atomic_t s_bal_allocated; /* in blocks */
> atomic_t s_bal_ex_scanned; /* total extents scanned */
> + atomic_t s_bal_cX_ex_scanned[EXT4_MB_NUM_CRS]; /* total extents scanned */
> atomic_t s_bal_groups_scanned; /* number of groups scanned */
> atomic_t s_bal_goals; /* goal hits */
> atomic_t s_bal_breaks; /* too long searches */
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 323604a2ff45..07a50a13751c 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2077,6 +2077,7 @@ static void ext4_mb_measure_extent(struct ext4_allocation_context *ac,
> BUG_ON(ac->ac_status != AC_STATUS_CONTINUE);
>
> ac->ac_found++;
> + ac->ac_cX_found[ac->ac_criteria]++;
>
> /*
> * The special case - take what you catch first
> @@ -2249,6 +2250,7 @@ void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac,
> break;
> }
> ac->ac_found++;
> + ac->ac_cX_found[ac->ac_criteria]++;
>
> ac->ac_b_ex.fe_len = 1 << i;
> ac->ac_b_ex.fe_start = k << i;
> @@ -2362,6 +2364,7 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
> max = mb_find_extent(e4b, i, sbi->s_stripe, &ex);
> if (max >= sbi->s_stripe) {
> ac->ac_found++;
> + ac->ac_cX_found[ac->ac_criteria]++;
> ex.fe_logical = 0xDEADF00D; /* debug value */
> ac->ac_b_ex = ex;
> ext4_mb_use_best_found(ac, e4b);
> @@ -2894,6 +2897,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
> seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR0]));
> seq_printf(seq, "\t\tgroups_considered: %llu\n",
> atomic64_read(&sbi->s_bal_cX_groups_considered[CR0]));
> + seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR0]));
> seq_printf(seq, "\t\tuseless_loops: %llu\n",
> atomic64_read(&sbi->s_bal_cX_failed[CR0]));
> seq_printf(seq, "\t\tbad_suggestions: %u\n",
> @@ -2903,6 +2907,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
> seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR1]));
> seq_printf(seq, "\t\tgroups_considered: %llu\n",
> atomic64_read(&sbi->s_bal_cX_groups_considered[CR1]));
> + seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR1]));
> seq_printf(seq, "\t\tuseless_loops: %llu\n",
> atomic64_read(&sbi->s_bal_cX_failed[CR1]));
> seq_printf(seq, "\t\tbad_suggestions: %u\n",
> @@ -2912,6 +2917,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
> seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR2]));
> seq_printf(seq, "\t\tgroups_considered: %llu\n",
> atomic64_read(&sbi->s_bal_cX_groups_considered[CR2]));
> + seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR2]));
> seq_printf(seq, "\t\tuseless_loops: %llu\n",
> atomic64_read(&sbi->s_bal_cX_failed[CR2]));
>
> @@ -2919,6 +2925,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
> seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR3]));
> seq_printf(seq, "\t\tgroups_considered: %llu\n",
> atomic64_read(&sbi->s_bal_cX_groups_considered[CR3]));
> + seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR3]));
> seq_printf(seq, "\t\tuseless_loops: %llu\n",
> atomic64_read(&sbi->s_bal_cX_failed[CR3]));
> seq_printf(seq, "\textents_scanned: %u\n", atomic_read(&sbi->s_bal_ex_scanned));
> @@ -4216,7 +4223,12 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
> atomic_add(ac->ac_b_ex.fe_len, &sbi->s_bal_allocated);
> if (ac->ac_b_ex.fe_len >= ac->ac_o_ex.fe_len)
> atomic_inc(&sbi->s_bal_success);
> +
> atomic_add(ac->ac_found, &sbi->s_bal_ex_scanned);
> + for (int i=0; i<EXT4_MB_NUM_CRS; i++) {
> + atomic_add(ac->ac_cX_found[i], &sbi->s_bal_cX_ex_scanned[i]);
> + }
> +
> atomic_add(ac->ac_groups_scanned, &sbi->s_bal_groups_scanned);
> if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start &&
> ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group)
> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> index f0087a85e366..004b8d163cc9 100644
> --- a/fs/ext4/mballoc.h
> +++ b/fs/ext4/mballoc.h
> @@ -193,6 +193,7 @@ struct ext4_allocation_context {
> __u16 ac_groups_scanned;
> __u16 ac_groups_linear_remaining;
> __u16 ac_found;
> + __u16 ac_cX_found[EXT4_MB_NUM_CRS];
> __u16 ac_tail;
> __u16 ac_buddy;
> __u8 ac_status;
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 27-01-23 18:07:33, Ojaswin Mujoo wrote:
> Track number of allocations where the length of blocks allocated is equal to the
> length of goal blocks (post normalization). This metric could be useful if
> making changes to the allocator logic in the future as it could give us
> visibility into how often do we trim our requests.
>
> PS: ac_b_ex.fe_len might get modified due to preallocation efforts and
> hence we use ac_f_ex.fe_len instead since we want to compare how much the
> allocator was able to actually find.
>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/ext4/ext4.h | 1 +
> fs/ext4/mballoc.c | 3 +++
> 2 files changed, 4 insertions(+)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 4ba2c95915eb..d8fa01e54e81 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1636,6 +1636,7 @@ struct ext4_sb_info {
> atomic_t s_bal_cX_ex_scanned[EXT4_MB_NUM_CRS]; /* total extents scanned */
> atomic_t s_bal_groups_scanned; /* number of groups scanned */
> atomic_t s_bal_goals; /* goal hits */
> + atomic_t s_bal_len_goals; /* len goal hits */
> atomic_t s_bal_breaks; /* too long searches */
> atomic_t s_bal_2orders; /* 2^order hits */
> atomic_t s_bal_cr0_bad_suggestions;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 07a50a13751c..c4ab8f412d32 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2930,6 +2930,7 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
> atomic64_read(&sbi->s_bal_cX_failed[CR3]));
> seq_printf(seq, "\textents_scanned: %u\n", atomic_read(&sbi->s_bal_ex_scanned));
> seq_printf(seq, "\t\tgoal_hits: %u\n", atomic_read(&sbi->s_bal_goals));
> + seq_printf(seq, "\t\tlen_goal_hits: %u\n", atomic_read(&sbi->s_bal_len_goals));
> seq_printf(seq, "\t\t2^n_hits: %u\n", atomic_read(&sbi->s_bal_2orders));
> seq_printf(seq, "\t\tbreaks: %u\n", atomic_read(&sbi->s_bal_breaks));
> seq_printf(seq, "\t\tlost: %u\n", atomic_read(&sbi->s_mb_lost_chunks));
> @@ -4233,6 +4234,8 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
> if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start &&
> ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group)
> atomic_inc(&sbi->s_bal_goals);
> + if (ac->ac_f_ex.fe_len == ac->ac_g_ex.fe_len)
> + atomic_inc(&sbi->s_bal_len_goals);
> if (ac->ac_found > sbi->s_mb_max_to_scan)
> atomic_inc(&sbi->s_bal_breaks);
> }
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 27-01-23 18:07:34, Ojaswin Mujoo wrote:
> When we are inside ext4_mb_complex_scan_group() in CR1, we can be sure
> that this group has atleast 1 big enough continuous free extent to satisfy
> our request because (free / fragments) > goal length.
>
> Hence, instead of wasting time looping over smaller free extents, only
> try to consider the free extent if we are sure that it has enough
> continuous free space to satisfy goal length. This is particularly
> useful when scanning highly fragmented BGs in CR1 as, without this
> patch, the allocator might stop scanning early before reaching the big
> enough free extent (due to ac_found > mb_max_to_scan) which causes us to
> uncessarily trim the request.
>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/ext4/mballoc.c | 19 ++++++++++++++++++-
> 1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index c4ab8f412d32..14529d2fe65f 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2279,7 +2279,7 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
> struct super_block *sb = ac->ac_sb;
> void *bitmap = e4b->bd_bitmap;
> struct ext4_free_extent ex;
> - int i;
> + int i, j, freelen;
> int free;
>
> free = e4b->bd_info->bb_free;
> @@ -2306,6 +2306,23 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
> break;
> }
>
> + if (ac->ac_criteria < CR2) {
> + /*
> + * In CR1, we are sure that this group will
> + * have a large enough continuous free extent, so skip
> + * over the smaller free extents
> + */
> + j = mb_find_next_bit(bitmap,
> + EXT4_CLUSTERS_PER_GROUP(sb), i);
> + freelen = j - i;
> +
> + if (freelen < ac->ac_g_ex.fe_len) {
> + i = j;
> + free -= freelen;
> + continue;
> + }
> + }
> +
> mb_find_extent(e4b, i, ac->ac_g_ex.fe_len, &ex);
> if (WARN_ON(ex.fe_len <= 0))
> break;
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 27-01-23 18:07:35, Ojaswin Mujoo wrote:
> Currently, ext4_mb_prefetch() and ext4_mb_prefetch_fini() skip
> BLOCK_UNINIT groups since fetching their bitmaps doesn't need disk IO.
> As a consequence, we end not initializing the buddy structures and CR0/1
> lists for these BGs, even though it can be done without any disk IO
> overhead. Hence, don't skip such BGs during prefetch and prefetch_fini.
>
> This improves the accuracy of CR0/1 allocation as earlier, we could have
> essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy
> not being initialized, leading to slower CR2 allocations. With this patch CR0/1
> will be able to discover these groups as well, thus improving performance.
>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
The patch looks good. I just somewhat wonder - this change may result in
uninitialized groups being initialized and used earlier (previously we'd
rather search in other already initialized groups) which may spread
allocations more. But I suppose that's fine and uninit groups are not
really a feature meant to limit fragmentation and as the filesystem ages
the differences should be minimal. So feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/ext4/mballoc.c | 8 ++------
> 1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 14529d2fe65f..48726a831264 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2557,9 +2557,7 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
> */
> if (!EXT4_MB_GRP_TEST_AND_SET_READ(grp) &&
> EXT4_MB_GRP_NEED_INIT(grp) &&
> - ext4_free_group_clusters(sb, gdp) > 0 &&
> - !(ext4_has_group_desc_csum(sb) &&
> - (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
> + ext4_free_group_clusters(sb, gdp) > 0 ) {
> bh = ext4_read_block_bitmap_nowait(sb, group, true);
> if (bh && !IS_ERR(bh)) {
> if (!buffer_uptodate(bh) && cnt)
> @@ -2600,9 +2598,7 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
> grp = ext4_get_group_info(sb, group);
>
> if (EXT4_MB_GRP_NEED_INIT(grp) &&
> - ext4_free_group_clusters(sb, gdp) > 0 &&
> - !(ext4_has_group_desc_csum(sb) &&
> - (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
> + ext4_free_group_clusters(sb, gdp) > 0) {
> if (ext4_mb_init_group(sb, group, GFP_NOFS))
> break;
> }
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 27-01-23 18:07:36, Ojaswin Mujoo wrote:
> Before this patch, the call stack in ext4_run_li_request is as follows:
>
> /*
> * nr = no. of BGs we want to fetch (=s_mb_prefetch)
> * prefetch_ios = no. of BGs not uptodate after
> * ext4_read_block_bitmap_nowait()
> */
> next_group = ext4_mb_prefetch(sb, group, nr, prefetch_ios);
> ext4_mb_prefetch_fini(sb, next_group prefetch_ios);
>
> ext4_mb_prefetch_fini() will only try to initialize buddies for BGs in
> range [next_group - prefetch_ios, next_group). This is incorrect since
> sometimes (prefetch_ios < nr), which causes ext4_mb_prefetch_fini() to
> incorrectly ignore some of the BGs that might need initialization. This
> issue is more notable now with the previous patch enabling "fetching" of
> BLOCK_UNINIT BGs which are marked buffer_uptodate by default.
>
> Fix this by passing nr to ext4_mb_prefetch_fini() instead of
> prefetch_ios so that it considers the right range of groups.
>
> Similarly, make sure we don't pass nr=0 to ext4_mb_prefetch_fini() in
> ext4_mb_regular_allocator() since we might have prefetched BLOCK_UNINIT
> groups that would need buddy initialization.
>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/ext4/mballoc.c | 4 ----
> fs/ext4/super.c | 11 ++++-------
> 2 files changed, 4 insertions(+), 11 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 48726a831264..410c9636907b 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2702,8 +2702,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> if ((prefetch_grp == group) &&
> (cr > CR1 ||
> prefetch_ios < sbi->s_mb_prefetch_limit)) {
> - unsigned int curr_ios = prefetch_ios;
> -
> nr = sbi->s_mb_prefetch;
> if (ext4_has_feature_flex_bg(sb)) {
> nr = 1 << sbi->s_log_groups_per_flex;
> @@ -2712,8 +2710,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> }
> prefetch_grp = ext4_mb_prefetch(sb, group,
> nr, &prefetch_ios);
> - if (prefetch_ios == curr_ios)
> - nr = 0;
> }
>
> /* This now checks without needing the buddy page */
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 72ead3b56706..9dbb09cfc8f7 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3636,16 +3636,13 @@ static int ext4_run_li_request(struct ext4_li_request *elr)
> ext4_group_t group = elr->lr_next_group;
> unsigned int prefetch_ios = 0;
> int ret = 0;
> + int nr = EXT4_SB(sb)->s_mb_prefetch;
> u64 start_time;
>
> if (elr->lr_mode == EXT4_LI_MODE_PREFETCH_BBITMAP) {
> - elr->lr_next_group = ext4_mb_prefetch(sb, group,
> - EXT4_SB(sb)->s_mb_prefetch, &prefetch_ios);
> - if (prefetch_ios)
> - ext4_mb_prefetch_fini(sb, elr->lr_next_group,
> - prefetch_ios);
> - trace_ext4_prefetch_bitmaps(sb, group, elr->lr_next_group,
> - prefetch_ios);
> + elr->lr_next_group = ext4_mb_prefetch(sb, group, nr, &prefetch_ios);
> + ext4_mb_prefetch_fini(sb, elr->lr_next_group, nr);
> + trace_ext4_prefetch_bitmaps(sb, group, elr->lr_next_group, nr);
> if (group >= elr->lr_next_group) {
> ret = 1;
> if (elr->lr_first_not_zeroed != ngroups &&
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 27-01-23 18:07:37, Ojaswin Mujoo wrote:
> Make the logic of searching average fragment list of a given order reusable
> by abstracting it out to a differnet function. This will also avoid
> code duplication in upcoming patches.
>
> No functional changes.
>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/ext4/mballoc.c | 51 ++++++++++++++++++++++++++++++-----------------
> 1 file changed, 33 insertions(+), 18 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 410c9636907b..1ce1174aea52 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -902,6 +902,37 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> }
> }
>
> +/*
> + * Find a suitable group of given order from the average fragments list.
> + */
> +static struct ext4_group_info *
> +ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int order)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + struct list_head *frag_list = &sbi->s_mb_avg_fragment_size[order];
> + rwlock_t *frag_list_lock = &sbi->s_mb_avg_fragment_size_locks[order];
> + struct ext4_group_info *grp = NULL, *iter;
> + enum criteria cr = ac->ac_criteria;
> +
> + if (list_empty(frag_list))
> + return NULL;
> + read_lock(frag_list_lock);
> + if (list_empty(frag_list)) {
> + read_unlock(frag_list_lock);
> + return NULL;
> + }
> + list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) {
> + if (sbi->s_mb_stats)
> + atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
> + if (likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
> + grp = iter;
> + break;
> + }
> + }
> + read_unlock(frag_list_lock);
> + return grp;
> +}
> +
> /*
> * Choose next group by traversing average fragment size list of suitable
> * order. Updates *new_cr if cr level needs an update.
> @@ -910,7 +941,7 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> {
> struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> - struct ext4_group_info *grp = NULL, *iter;
> + struct ext4_group_info *grp = NULL;
> int i;
>
> if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
> @@ -920,23 +951,7 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
>
> for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
> i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> - if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
> - continue;
> - read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
> - if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
> - read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
> - continue;
> - }
> - list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
> - bb_avg_fragment_size_node) {
> - if (sbi->s_mb_stats)
> - atomic64_inc(&sbi->s_bal_cX_groups_considered[CR1]);
> - if (likely(ext4_mb_good_group(ac, iter->bb_group, CR1))) {
> - grp = iter;
> - break;
> - }
> - }
> - read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
> + grp = ext4_mb_find_good_group_avg_frag_lists(ac, i);
> if (grp)
> break;
> }
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 27-01-23 18:07:38, Ojaswin Mujoo wrote:
> CR1_5 aims to optimize allocations which can't be satisfied in CR1. The
> fact that we couldn't find a group in CR1 suggests that it would be
> difficult to find a continuous extent to compleltely satisfy our
> allocations. So before falling to the slower CR2, in CR1.5 we
> proactively trim the the preallocations so we can find a group with
> (free / fragments) big enough. This speeds up our allocation at the
> cost of slightly reduced preallocation.
>
> The patch also adds a new sysfs tunable:
>
> * /sys/fs/ext4/<partition>/mb_cr1_5_max_trim_order
>
> This controls how much CR1.5 can trim a request before falling to CR2.
> For example, for a request of order 7 and max trim order 2, CR1.5 can
> trim this upto order 5.
>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
The idea looks good. Couple of questions below...
> +/*
> + * We couldn't find a group in CR1 so try to find the highest free fragment
> + * order we have and proactively trim the goal request length to that order to
> + * find a suitable group faster.
> + *
> + * This optimizes allocation speed at the cost of slightly reduced
> + * preallocations. However, we make sure that we don't trim the request too
> + * much and fall to CR2 in that case.
> + */
> +static void ext4_mb_choose_next_group_cr1_5(struct ext4_allocation_context *ac,
> + enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + struct ext4_group_info *grp = NULL;
> + int i, order, min_order;
> +
> + if (unlikely(ac->ac_flags & EXT4_MB_CR1_5_OPTIMIZED)) {
> + if (sbi->s_mb_stats)
> + atomic_inc(&sbi->s_bal_cr1_5_bad_suggestions);
> + }
> +
> + /*
> + * mb_avg_fragment_size_order() returns order in a way that makes
> + * retrieving back the length using (1 << order) inaccurate. Hence, use
> + * fls() instead since we need to know the actual length while modifying
> + * goal length.
> + */
> + order = fls(ac->ac_g_ex.fe_len);
> + min_order = order - sbi->s_mb_cr1_5_max_trim_order;
Given we still require the allocation contains at least originally
requested blocks, is it ever the case that goal size would be 8 times
larger than original alloc size? Otherwise the
sbi->s_mb_cr1_5_max_trim_order logic seems a bit pointless...
> + if (min_order < 0)
> + min_order = 0;
Perhaps add:
if (1 << min_order < ac->ac_o_ex.fe_len)
min_order = fls(ac->ac_o_ex.fe_len) + 1;
and then you can drop the condition from the loop below...
> +
> + for (i = order; i >= min_order; i--) {
> + if (ac->ac_o_ex.fe_len <= (1 << i)) {
> + /*
> + * Scale down goal len to make sure we find something
> + * in the free fragments list. Basically, reduce
> + * preallocations.
> + */
> + ac->ac_g_ex.fe_len = 1 << i;
When scaling down the size with sbi->s_stripe > 1, it would be better to
choose multiple of sbi->s_stripe and not power of two. But our stripe
support is fairly weak anyway (e.g. initial goal size does not reflect it
at all AFAICT) so probably we don't care here either.
> + } else {
> + break;
> + }
> +
> + grp = ext4_mb_find_good_group_avg_frag_lists(ac,
> + mb_avg_fragment_size_order(ac->ac_sb,
> + ac->ac_g_ex.fe_len));
> + if (grp)
> + break;
> + }
> +
> + if (grp) {
> + *group = grp->bb_group;
> + ac->ac_flags |= EXT4_MB_CR1_5_OPTIMIZED;
> + } else {
> + /* Reset goal length to original goal length before falling into CR2 */
> + ac->ac_g_ex.fe_len = ac->ac_orig_goal_len;
> *new_cr = CR2;
> }
> }
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, Mar 09, 2023 at 01:11:22PM +0100, Jan Kara wrote:
> On Fri 27-01-23 18:07:31, Ojaswin Mujoo wrote:
> > Convert criteria to be an enum so it easier to maintain. This change
> > also makes it easier to insert new criterias in the future.
> >
> > Signed-off-by: Ojaswin Mujoo <[email protected]>
> > Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
>
> Just two small comments below:
Hi Jan,
Thanks for the review.
>
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index b8b00457da8d..6037b8e0af86 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -126,6 +126,14 @@ enum SHIFT_DIRECTION {
> > SHIFT_RIGHT,
> > };
> >
> > +/*
> > + * Number of criterias defined. For each criteria, mballoc has slightly
> > + * different way of finding the required blocks nad usually, higher the
> ^^^ and
>
> > + * criteria the slower the allocation. We start at lower criterias and keep
> > + * falling back to higher ones if we are not able to find any blocks.
> > + */
> > +#define EXT4_MB_NUM_CRS 4
> > +
>
> So defining this in a different header than the enum itself is fragile. I
> understand you need it in ext4_sb_info declaration so probably I'd move the
> enum declaration to ext4.h. Alternatively I suppose we could move a lot of
Got it, I'll try to keep them in the same file.
> mballoc stuff out of ext4_sb_info into a separate struct because there's a
> lot of it. But that would be much larger undertaking.
Right, we did notice that as well, but as you said, that's out of scope
of this patchset.
>
> Also when going for symbolic allocator scan names maybe we could actually
> make names sensible instead of CR[0-4]? Perhaps like CR_ORDER2_ALIGNED,
> CR_BEST_LENGHT_FAST, CR_BEST_LENGTH_ALL, CR_ANY_FREE. And probably we could
> deal with ordered comparisons like in:
I like this idea, it should make the code a bit more easier to
understand. However just wondering if I should do it as a part of this
series or a separate patch since we'll be touching code all around and
I don't want to confuse people with the noise :)
>
> if (cr < 2 &&
> (!sbi->s_log_groups_per_flex ||
> ((group & ((1 << sbi->s_log_groups_per_flex) - 1)) != 0)) &
> !(ext4_has_group_desc_csum(sb) &&
> (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))))
> return 0;
>
> to declare CR_FAST_SCAN = 2, or something like that. What do you think?
About this, wont it be better to just use something like
cr < CR_BEST_LENGTH_ALL
instead of defining a new CR_FAST_SCAN = 2.
The only concern is that if we add a new "fast" CR (say between
CR_BEST_LENGTH_FAST and CR_BEST_LENGTH_ALL) then we'll need to make
sure we also update CR_FAST_SCAN to 3 which is easy to miss.
Regards,
Ojaswin
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
On Thu, Mar 09, 2023 at 03:14:22PM +0100, Jan Kara wrote:
> On Fri 27-01-23 18:07:35, Ojaswin Mujoo wrote:
> > Currently, ext4_mb_prefetch() and ext4_mb_prefetch_fini() skip
> > BLOCK_UNINIT groups since fetching their bitmaps doesn't need disk IO.
> > As a consequence, we end not initializing the buddy structures and CR0/1
> > lists for these BGs, even though it can be done without any disk IO
> > overhead. Hence, don't skip such BGs during prefetch and prefetch_fini.
> >
> > This improves the accuracy of CR0/1 allocation as earlier, we could have
> > essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy
> > not being initialized, leading to slower CR2 allocations. With this patch CR0/1
> > will be able to discover these groups as well, thus improving performance.
> >
> > Signed-off-by: Ojaswin Mujoo <[email protected]>
> > Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
>
> The patch looks good. I just somewhat wonder - this change may result in
> uninitialized groups being initialized and used earlier (previously we'd
> rather search in other already initialized groups) which may spread
> allocations more. But I suppose that's fine and uninit groups are not
> really a feature meant to limit fragmentation and as the filesystem ages
> the differences should be minimal. So feel free to add:
>
> Reviewed-by: Jan Kara <[email protected]>
>
> Honza
Thanks for the review. As for the allocation spread, I agree that it
should be something our goal determination logic should take care of
rather than limiting the BGs available to the allocator.
Another point I wanted to discuss wrt this patch series was why were the
BLOCK_UNINIT groups not being prefetched earlier. One point I can think
of is that this might lead to memory pressure when we have too many
empty BGs in a very large (say terabytes) disk.
But i'd still like to know if there's some history behind not
prefetching block uninit.
Cc'ing Andreas as well to check if they came across anything in Lustre
in the past.
>
> > ---
> > fs/ext4/mballoc.c | 8 ++------
> > 1 file changed, 2 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 14529d2fe65f..48726a831264 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -2557,9 +2557,7 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
> > */
> > if (!EXT4_MB_GRP_TEST_AND_SET_READ(grp) &&
> > EXT4_MB_GRP_NEED_INIT(grp) &&
> > - ext4_free_group_clusters(sb, gdp) > 0 &&
> > - !(ext4_has_group_desc_csum(sb) &&
> > - (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
> > + ext4_free_group_clusters(sb, gdp) > 0 ) {
> > bh = ext4_read_block_bitmap_nowait(sb, group, true);
> > if (bh && !IS_ERR(bh)) {
> > if (!buffer_uptodate(bh) && cnt)
> > @@ -2600,9 +2598,7 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
> > grp = ext4_get_group_info(sb, group);
> >
> > if (EXT4_MB_GRP_NEED_INIT(grp) &&
> > - ext4_free_group_clusters(sb, gdp) > 0 &&
> > - !(ext4_has_group_desc_csum(sb) &&
> > - (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
> > + ext4_free_group_clusters(sb, gdp) > 0) {
> > if (ext4_mb_init_group(sb, group, GFP_NOFS))
> > break;
> > }
> > --
> > 2.31.1
> >
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
On Thu, Mar 09, 2023 at 04:06:49PM +0100, Jan Kara wrote:
> On Fri 27-01-23 18:07:38, Ojaswin Mujoo wrote:
> > CR1_5 aims to optimize allocations which can't be satisfied in CR1. The
> > fact that we couldn't find a group in CR1 suggests that it would be
> > difficult to find a continuous extent to compleltely satisfy our
> > allocations. So before falling to the slower CR2, in CR1.5 we
> > proactively trim the the preallocations so we can find a group with
> > (free / fragments) big enough. This speeds up our allocation at the
> > cost of slightly reduced preallocation.
> >
> > The patch also adds a new sysfs tunable:
> >
> > * /sys/fs/ext4/<partition>/mb_cr1_5_max_trim_order
> >
> > This controls how much CR1.5 can trim a request before falling to CR2.
> > For example, for a request of order 7 and max trim order 2, CR1.5 can
> > trim this upto order 5.
> >
> > Signed-off-by: Ojaswin Mujoo <[email protected]>
> > Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
>
> The idea looks good. Couple of questions below...
>
> > +/*
> > + * We couldn't find a group in CR1 so try to find the highest free fragment
> > + * order we have and proactively trim the goal request length to that order to
> > + * find a suitable group faster.
> > + *
> > + * This optimizes allocation speed at the cost of slightly reduced
> > + * preallocations. However, we make sure that we don't trim the request too
> > + * much and fall to CR2 in that case.
> > + */
> > +static void ext4_mb_choose_next_group_cr1_5(struct ext4_allocation_context *ac,
> > + enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> > + struct ext4_group_info *grp = NULL;
> > + int i, order, min_order;
> > +
> > + if (unlikely(ac->ac_flags & EXT4_MB_CR1_5_OPTIMIZED)) {
> > + if (sbi->s_mb_stats)
> > + atomic_inc(&sbi->s_bal_cr1_5_bad_suggestions);
> > + }
> > +
> > + /*
> > + * mb_avg_fragment_size_order() returns order in a way that makes
> > + * retrieving back the length using (1 << order) inaccurate. Hence, use
> > + * fls() instead since we need to know the actual length while modifying
> > + * goal length.
> > + */
> > + order = fls(ac->ac_g_ex.fe_len);
> > + min_order = order - sbi->s_mb_cr1_5_max_trim_order;
>
> Given we still require the allocation contains at least originally
> requested blocks, is it ever the case that goal size would be 8 times
> larger than original alloc size? Otherwise the
> sbi->s_mb_cr1_5_max_trim_order logic seems a bit pointless...
Yes that is possible. In ext4_mb_normalize_request, for orignal request len <
8MB we actually determine the goal length based on the length of the
file (i_size) rather than the length of the original request. For eg:
if (size <= 16 * 1024) {
size = 16 * 1024;
} else if (size <= 32 * 1024) {
size = 32 * 1024;
} else if (size <= 64 * 1024) {
size = 64 * 1024;
and this goes all the way upto size = 8MB. So for a case where the file
is >8MB, even if the original len is of 1 block(4KB), the goal len would
be of 2048 blocks(8MB). That's why we decided to add a tunable depending
on the user's preference.
>
> > + if (min_order < 0)
> > + min_order = 0;
>
> Perhaps add:
>
> if (1 << min_order < ac->ac_o_ex.fe_len)
> min_order = fls(ac->ac_o_ex.fe_len) + 1;
>
> and then you can drop the condition from the loop below...
That looks better, will do it. Thanks!
>
> > +
> > + for (i = order; i >= min_order; i--) {
> > + if (ac->ac_o_ex.fe_len <= (1 << i)) {
> > + /*
> > + * Scale down goal len to make sure we find something
> > + * in the free fragments list. Basically, reduce
> > + * preallocations.
> > + */
> > + ac->ac_g_ex.fe_len = 1 << i;
>
> When scaling down the size with sbi->s_stripe > 1, it would be better to
> choose multiple of sbi->s_stripe and not power of two. But our stripe
> support is fairly weak anyway (e.g. initial goal size does not reflect it
> at all AFAICT) so probably we don't care here either.
Oh right, i missed that. I'll make the change as it doesn't harm to have
it here.
Thanks for the review!
regards,
ojaswin
>
> > + } else {
> > + break;
> > + }
> > +
> > + grp = ext4_mb_find_good_group_avg_frag_lists(ac,
> > + mb_avg_fragment_size_order(ac->ac_sb,
> > + ac->ac_g_ex.fe_len));
> > + if (grp)
> > + break;
> > + }
> > +
> > + if (grp) {
> > + *group = grp->bb_group;
> > + ac->ac_flags |= EXT4_MB_CR1_5_OPTIMIZED;
> > + } else {
> > + /* Reset goal length to original goal length before falling into CR2 */
> > + ac->ac_g_ex.fe_len = ac->ac_orig_goal_len;
> > *new_cr = CR2;
> > }
> > }
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
On Thu, Mar 23, 2023 at 11:57:10AM +0100, Jan Kara wrote:
> On Fri 17-03-23 16:25:04, Ojaswin Mujoo wrote:
> > On Thu, Mar 09, 2023 at 03:14:22PM +0100, Jan Kara wrote:
> > > On Fri 27-01-23 18:07:35, Ojaswin Mujoo wrote:
> > > > Currently, ext4_mb_prefetch() and ext4_mb_prefetch_fini() skip
> > > > BLOCK_UNINIT groups since fetching their bitmaps doesn't need disk IO.
> > > > As a consequence, we end not initializing the buddy structures and CR0/1
> > > > lists for these BGs, even though it can be done without any disk IO
> > > > overhead. Hence, don't skip such BGs during prefetch and prefetch_fini.
> > > >
> > > > This improves the accuracy of CR0/1 allocation as earlier, we could have
> > > > essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy
> > > > not being initialized, leading to slower CR2 allocations. With this patch CR0/1
> > > > will be able to discover these groups as well, thus improving performance.
> > > >
> > > > Signed-off-by: Ojaswin Mujoo <[email protected]>
> > > > Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
> > >
> > > The patch looks good. I just somewhat wonder - this change may result in
> > > uninitialized groups being initialized and used earlier (previously we'd
> > > rather search in other already initialized groups) which may spread
> > > allocations more. But I suppose that's fine and uninit groups are not
> > > really a feature meant to limit fragmentation and as the filesystem ages
> > > the differences should be minimal. So feel free to add:
> > >
> > > Reviewed-by: Jan Kara <[email protected]>
> > >
> > > Honza
> > Thanks for the review. As for the allocation spread, I agree that it
> > should be something our goal determination logic should take care of
> > rather than limiting the BGs available to the allocator.
> >
> > Another point I wanted to discuss wrt this patch series was why were the
> > BLOCK_UNINIT groups not being prefetched earlier. One point I can think
> > of is that this might lead to memory pressure when we have too many
> > empty BGs in a very large (say terabytes) disk.
> >
> > But i'd still like to know if there's some history behind not
> > prefetching block uninit.
>
> Hum, I don't remember anything. Maybe Ted will. You can ask him today on a
> call.
Unfortunately, couldn't join it last time :) I'll check with him on
upcoming Thurs.
Regards,
ojaswin
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
On Fri, Mar 17, 2023 at 04:25:04PM +0530, Ojaswin Mujoo wrote:
> > > This improves the accuracy of CR0/1 allocation as earlier, we could have
> > > essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy
> > > not being initialized, leading to slower CR2 allocations. With this patch CR0/1
> > > will be able to discover these groups as well, thus improving performance.
> >
> > The patch looks good. I just somewhat wonder - this change may result in
> > uninitialized groups being initialized and used earlier (previously we'd
> > rather search in other already initialized groups) which may spread
> > allocations more. But I suppose that's fine and uninit groups are not
> > really a feature meant to limit fragmentation and as the filesystem ages
> > the differences should be minimal. So feel free to add:
>
> Another point I wanted to discuss wrt this patch series was why were the
> BLOCK_UNINIT groups not being prefetched earlier. One point I can think
> of is that this might lead to memory pressure when we have too many
> empty BGs in a very large (say terabytes) disk.
Originally the prefetch logic was simply something to optimize I/O ---
that is, normally, all of the block bitmaps for a flex_bg are
contiguous, so why not just read them all in a single I/O which is
issued all at once, instead of doing them as separate 4k reads.
Skipping block groups that hadn't yet been prefetched was something
which was added later, in order to improve performance of the
allocator for freshly mounted file systems where the prefetch hadn't
yet had a chance to pull in block bitmaps; the problem was that if the
block groups hadn't been prefetch yet, then the cr0 scan would fetch
them, and if you have a storage device where blocks with monotonically
increasing LBA numbers aren't necessarily stored adjacently on disk
(for example, on a dm-thin volume, but if one were to do an experiment
on certain emulated block devices on certain hyperscalar cloud
environments, one might find a similar performance profile), resulting
in a cr0 scan potentially issuing a series of 16 sequential 4k I/O's,
that could be substantially worse from a performance standpoint than
doing a single squential 64k I/O.
When this change was made, the focus was on *initialized* bitmaps
taking a long time if they were issued as individual sequential 4k
I/O's; the fix was to skip scanning them initially, since the hope was
that the prefetch would pull them in fairly quickly, and a few bad
allocations when the file system was freshly mounted was an acceptable
tradeoff.
But prefetching prefetching BLOCK_UNINIT groups makes sense, that
should fix the problem that you've identified (at least for
BLOCK_UNINIT groups; for initialized block bitmaps, we'll still have
less optimal allocation patterns until we've managed to prefetch those
block groups).
Cheers,
0 Ted
On Sat, Mar 25, 2023 at 08:12:36PM +0530, Ojaswin Mujoo wrote:
> On Thu, Mar 23, 2023 at 11:55:37AM +0100, Jan Kara wrote:
> > On Fri 17-03-23 15:56:46, Ojaswin Mujoo wrote:
> > > On Thu, Mar 09, 2023 at 01:11:22PM +0100, Jan Kara wrote:
> > > > Also when going for symbolic allocator scan names maybe we could actually
> > > > make names sensible instead of CR[0-4]? Perhaps like CR_ORDER2_ALIGNED,
> > > > CR_BEST_LENGHT_FAST, CR_BEST_LENGTH_ALL, CR_ANY_FREE. And probably we could
> > > > deal with ordered comparisons like in:
> > > I like this idea, it should make the code a bit more easier to
> > > understand. However just wondering if I should do it as a part of this
> > > series or a separate patch since we'll be touching code all around and
> > > I don't want to confuse people with the noise :)
> >
> > I guess a mechanical rename should not be really confusing. It just has to
> > be a separate patch.
> Alright, got it.
> >
> > > >
> > > > if (cr < 2 &&
> > > > (!sbi->s_log_groups_per_flex ||
> > > > ((group & ((1 << sbi->s_log_groups_per_flex) - 1)) != 0)) &
> > > > !(ext4_has_group_desc_csum(sb) &&
> > > > (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))))
> > > > return 0;
> > > >
> > > > to declare CR_FAST_SCAN = 2, or something like that. What do you think?
> > > About this, wont it be better to just use something like
> > >
> > > cr < CR_BEST_LENGTH_ALL
> > >
> > > instead of defining a new CR_FAST_SCAN = 2.
> >
> > Yeah, that works as well.
> >
> > > The only concern is that if we add a new "fast" CR (say between
> > > CR_BEST_LENGTH_FAST and CR_BEST_LENGTH_ALL) then we'll need to make
> > > sure we also update CR_FAST_SCAN to 3 which is easy to miss.
> >
> > Well, you have that problem with any naming scheme (and even with numbers).
> > So as long as names are all defined together, there's reasonable chance
> > you'll remember to verify the limits still hold :)
> haha that's true. Anyways, I'll try a few things and see what looks
> good. Thanks for the suggestions.
Hey Jan,
So I was playing around with this and I prepare a patch to convert CR
numbers to symbolic names and it looks good as far as things like these
are concerned:
if (cr < CR_POWER2_ALIGNED)
...
However there's one problem that this numeric naming scheme is used in
several places like struct member names, function names, traces and
comments. The issue is that replacing it everywhere is making some of
the names very long for example:
atomic_read(&sbi->s_bal_cr0_bad_suggestions));
becomes:
atomic_read(&sbi->s_bal_cr_power2_aligned_bad_suggestions));
And this is kind of making the code look messy at a lot of places. So
right now there are a few options we can consider:
1. Use symbolic names everywhere at the cost of readability
2. Keep function names/members as is but change criterias enums to symbolic
names. This again is not ideal as it makes things ambiguous.
3. Keep the enums as is i.e. CR0, CR1.. etc and add the documentation in the enum
declaration on what these enum means. (we can still use CR_FAST_SCAN).
Let me know you thoughts on this, or incase you'd like to look at the
patch I can attach that as well.
Regards,
ojaswin
>
> Regards,
> ojaswin
> >
> > Honza
> > --
> > Jan Kara <[email protected]>
> > SUSE Labs, CR