This patch series improves cr 0 and cr 1 passes of the allocator
signficantly. Currently, at cr 0 and 1, we perform linear lookups to
find the matching groups. That's very inefficient for large file
systems where there are millions of block groups. At cr 0, we only
care about the groups that have the largest free order >= the
request's order and at cr 1 we only care about groups where average
fragment size > the request size. so, this patchset introduces new
data structures that allow us to perform cr 0 lookup in constant time
and cr 1 lookup in log (number of groups) time instead of linear.
For cr 0, we add a list for each order and all the groups are enqueued
to the appropriate list based on the largest free order in its buddy
bitmap. This allows us to lookup a match at cr 0 in constant time.
For cr 1, we add a new rb tree of groups sorted by largest fragment
size. This allows us to lookup a match for cr 1 in log (num groups)
time.
These optimizations can be enabled by passing "mb_optimize_scan" mount
option.
These changes may result in allocations to be spread across the block
device. While that would not matter some block devices (such as flash)
it may be a cause of concern for other block devices that benefit from
storing related content togetther such as disk. However, it can be
argued that in high fragmentation scenrio, especially for large disks,
it's still worth optimizing the scanning since in such cases, we get
cpu bound on group scanning instead of getting IO bound. Perhaps, in
future, we could dynamically turn this new optimization on based on
fragmentation levels for such devices.
Verified that there are no regressions in smoke tests (-g quick -c 4k).
Also, to demonstrate the effectiveness for the patch series, following
experiment was performed:
Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:
time dd if=/dev/urandom of=file bs=2M count=10
Here are the results with and without cr 0/1 optimizations:
|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|
The patch [3/6] "ext4: add mballoc stats proc file" is a modified
version of the patch originally written by Artem Blagodarenko
([email protected]). With that patch, I ran following
command with and without optimizations.
dd if=/dev/zero of=/mnt/file bs=2M count=2 conv=fsync
Without optimizations:
useless_c0_loops: 3
useless_c1_loops: 39
useless_c2_loops: 0
useless_c3_loops: 0
With optimizations:
useless_c0_loops: 0
useless_c1_loops: 0
useless_c2_loops: 0
useless_c3_loops: 0
This shows that CR0 and CR1 optimizations get rid of useless CR0 and
CR1 loops altogether thereby significantly reducing the number of
groups that get considered.
Changes from V4:
----------------
- Only minor fixes, no significant changes
Harshad Shirwadkar (6):
ext4: drop s_mb_bal_lock and convert protected fields to atomic
ext4: add ability to return parsed options from parse_options
ext4: add mballoc stats proc file
ext4: add MB_NUM_ORDERS macro
ext4: improve cr 0 / cr 1 group scanning
ext4: add proc files to monitor new structures
fs/ext4/ext4.h | 30 ++-
fs/ext4/mballoc.c | 572 +++++++++++++++++++++++++++++++++++++++++++---
fs/ext4/mballoc.h | 22 +-
fs/ext4/super.c | 79 +++++--
fs/ext4/sysfs.c | 6 +
5 files changed, 652 insertions(+), 57 deletions(-)
--
2.31.0.291.g576ba9dcdaf-goog
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.
At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).
For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.
There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.
All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".
With this patchset, following experiment was performed:
Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:
time dd if=/dev/urandom of=file bs=2M count=10
Here are the results with and without cr 0/1 optimizations introduced
in this patch:
|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|
Signed-off-by: Harshad Shirwadkar <[email protected]>
Reported-by: kernel test robot <[email protected]>
Reported-by: Dan Carpenter <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
---
fs/ext4/ext4.h | 19 ++-
fs/ext4/mballoc.c | 381 ++++++++++++++++++++++++++++++++++++++++++++--
fs/ext4/mballoc.h | 17 ++-
fs/ext4/super.c | 28 +++-
fs/ext4/sysfs.c | 2 +
5 files changed, 432 insertions(+), 15 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 85eeeba3bca3..5930c8cb5159 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -162,7 +162,10 @@ enum SHIFT_DIRECTION {
#define EXT4_MB_USE_RESERVED 0x2000
/* Do strict check for free blocks while retrying block allocation */
#define EXT4_MB_STRICT_CHECK 0x4000
-
+/* Large fragment size list lookup succeeded at least once for cr = 0 */
+#define EXT4_MB_CR0_OPTIMIZED 0x8000
+/* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
+#define EXT4_MB_CR1_OPTIMIZED 0x00010000
struct ext4_allocation_request {
/* target inode for block we're allocating */
struct inode *inode;
@@ -1247,7 +1250,9 @@ struct ext4_inode_info {
#define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */
#define EXT4_MOUNT2_DAX_NEVER 0x00000020 /* Do not allow Direct Access */
#define EXT4_MOUNT2_DAX_INODE 0x00000040 /* For printing options only */
-
+#define EXT4_MOUNT2_MB_OPTIMIZE_SCAN 0x00000080 /* Optimize group
+ * scanning in mballoc
+ */
#define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
~EXT4_MOUNT_##opt
@@ -1527,9 +1532,14 @@ struct ext4_sb_info {
unsigned int s_mb_free_pending;
struct list_head s_freed_data_list; /* List of blocks to be freed
after commit completed */
+ struct rb_root s_mb_avg_fragment_size_root;
+ rwlock_t s_mb_rb_lock;
+ struct list_head *s_mb_largest_free_orders;
+ rwlock_t *s_mb_largest_free_orders_locks;
/* tunables */
unsigned long s_stripe;
+ unsigned int s_mb_linear_limit;
unsigned int s_mb_stream_request;
unsigned int s_mb_max_to_scan;
unsigned int s_mb_min_to_scan;
@@ -1553,6 +1563,8 @@ struct ext4_sb_info {
atomic_t s_bal_goals; /* goal hits */
atomic_t s_bal_breaks; /* too long searches */
atomic_t s_bal_2orders; /* 2^order hits */
+ atomic_t s_bal_cr0_bad_suggestions;
+ atomic_t s_bal_cr1_bad_suggestions;
atomic64_t s_bal_cX_groups_considered[4];
atomic64_t s_bal_cX_hits[4];
atomic64_t s_bal_cX_failed[4]; /* cX loop didn't find blocks */
@@ -3309,11 +3321,14 @@ struct ext4_group_info {
ext4_grpblk_t bb_free; /* total free blocks */
ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */
+ ext4_group_t bb_group; /* Group number */
struct list_head bb_prealloc_list;
#ifdef DOUBLE_CHECK
void *bb_bitmap;
#endif
struct rw_semaphore alloc_sem;
+ struct rb_node bb_avg_fragment_size_rb;
+ struct list_head bb_largest_free_order_node;
ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
* regions, index is order.
* bb_counters[3] = 5 means
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 15127d815461..cbf9a89c0ef5 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -127,11 +127,50 @@
* smallest multiple of the stripe value (sbi->s_stripe) which is
* greater than the default mb_group_prealloc.
*
+ * If "mb_optimize_scan" mount option is set, we maintain in memory group info
+ * structures in two data structures:
+ *
+ * 1) Array of largest free order lists (sbi->s_mb_largest_free_orders)
+ *
+ * Locking: sbi->s_mb_largest_free_orders_locks(array of rw locks)
+ *
+ * This is an array of lists where the index in the array represents the
+ * largest free order in the buddy bitmap of the participating group infos of
+ * that list. So, there are exactly MB_NUM_ORDERS(sb) (which means total
+ * number of buddy bitmap orders possible) number of lists. Group-infos are
+ * placed in appropriate lists.
+ *
+ * 2) Average fragment size rb tree (sbi->s_mb_avg_fragment_size_root)
+ *
+ * Locking: sbi->s_mb_rb_lock (rwlock)
+ *
+ * This is a red black tree consisting of group infos and the tree is sorted
+ * by average fragment sizes (which is calculated as ext4_group_info->bb_free
+ * / ext4_group_info->bb_fragments).
+ *
+ * When "mb_optimize_scan" mount option is set, mballoc consults the above data
+ * structures to decide the order in which groups are to be traversed for
+ * fulfilling an allocation request.
+ *
+ * At CR = 0, we look for groups which have the largest_free_order >= the order
+ * of the request. We directly look at the largest free order list in the data
+ * structure (1) above where largest_free_order = order of the request. If that
+ * list is empty, we look at remaining list in the increasing order of
+ * largest_free_order. This allows us to perform CR = 0 lookup in O(1) time.
+ *
+ * At CR = 1, we only consider groups where average fragment size > request
+ * size. So, we lookup a group which has average fragment size just above or
+ * equal to request size using our rb tree (data structure 2) in O(log N) time.
+ *
+ * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
+ * linear order which requires O(N) search time for each CR 0 and CR 1 phase.
+ *
* The regular allocator (using the buddy cache) supports a few tunables.
*
* /sys/fs/ext4/<partition>/mb_min_to_scan
* /sys/fs/ext4/<partition>/mb_max_to_scan
* /sys/fs/ext4/<partition>/mb_order2_req
+ * /sys/fs/ext4/<partition>/mb_linear_limit
*
* The regular allocator uses buddy scan only if the request len is power of
* 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The
@@ -149,6 +188,16 @@
* can be used for allocation. ext4_mb_good_group explains how the groups are
* checked.
*
+ * When "mb_optimize_scan" is turned on, as mentioned above, the groups may not
+ * get traversed linearly. That may result in subsequent allocations being not
+ * close to each other. And so, the underlying device may get filled up in a
+ * non-linear fashion. While that may not matter on non-rotational devices, for
+ * rotational devices that may result in higher seek times. "mb_linear_limit"
+ * tells mballoc how many groups mballoc should search linearly before
+ * performing consulting above data structures for more efficient lookups. For
+ * non rotational devices, this value defaults to 0 and for rotational devices
+ * this is set to MB_DEFAULT_LINEAR_LIMIT.
+ *
* Both the prealloc space are getting populated as above. So for the first
* request we will hit the buddy cache which will result in this prealloc
* space getting filled. The prealloc space is then later used for the
@@ -299,6 +348,8 @@
* - bitlock on a group (group)
* - object (inode/locality) (object)
* - per-pa lock (pa)
+ * - cr0 lists lock (cr0)
+ * - cr1 tree lock (cr1)
*
* Paths:
* - new pa
@@ -328,6 +379,9 @@
* group
* object
*
+ * - allocation path (ext4_mb_regular_allocator)
+ * group
+ * cr0/cr1
*/
static struct kmem_cache *ext4_pspace_cachep;
static struct kmem_cache *ext4_ac_cachep;
@@ -351,6 +405,9 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
ext4_group_t group);
static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
+static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
+ ext4_group_t group, int cr);
+
/*
* The algorithm using this percpu seq counter goes below:
* 1. We sample the percpu discard_pa_seq counter before trying for block
@@ -744,6 +801,251 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
}
}
+static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
+ int (*cmp)(struct rb_node *, struct rb_node *))
+{
+ struct rb_node **iter = &root->rb_node, *parent = NULL;
+
+ while (*iter) {
+ parent = *iter;
+ if (cmp(new, *iter) > 0)
+ iter = &((*iter)->rb_left);
+ else
+ iter = &((*iter)->rb_right);
+ }
+
+ rb_link_node(new, parent, iter);
+ rb_insert_color(new, root);
+}
+
+static int
+ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
+{
+ struct ext4_group_info *grp1 = rb_entry(rb1,
+ struct ext4_group_info,
+ bb_avg_fragment_size_rb);
+ struct ext4_group_info *grp2 = rb_entry(rb2,
+ struct ext4_group_info,
+ bb_avg_fragment_size_rb);
+ int num_frags_1, num_frags_2;
+
+ num_frags_1 = grp1->bb_fragments ?
+ grp1->bb_free / grp1->bb_fragments : 0;
+ num_frags_2 = grp2->bb_fragments ?
+ grp2->bb_free / grp2->bb_fragments : 0;
+
+ return (num_frags_2 - num_frags_1);
+}
+
+/*
+ * Reinsert grpinfo into the avg_fragment_size tree with new average
+ * fragment size.
+ */
+static void
+mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
+ return;
+
+ write_lock(&sbi->s_mb_rb_lock);
+ if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
+ rb_erase(&grp->bb_avg_fragment_size_rb,
+ &sbi->s_mb_avg_fragment_size_root);
+ RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
+ }
+
+ ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
+ &grp->bb_avg_fragment_size_rb,
+ ext4_mb_avg_fragment_size_cmp);
+ write_unlock(&sbi->s_mb_rb_lock);
+}
+
+/*
+ * Choose next group by traversing largest_free_order lists. Return 0 if next
+ * group was selected optimally. Return 1 if next group was not selected
+ * optimally. Updates *new_cr if cr level needs an update.
+ */
+static int ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
+ int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ struct ext4_group_info *iter, *grp;
+ int i;
+
+ if (ac->ac_status == AC_STATUS_FOUND)
+ return 1;
+
+ if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
+ atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
+
+ grp = NULL;
+ for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
+ if (list_empty(&sbi->s_mb_largest_free_orders[i]))
+ continue;
+ read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
+ if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
+ read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
+ continue;
+ }
+ grp = NULL;
+ list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
+ bb_largest_free_order_node) {
+ if (sbi->s_mb_stats)
+ atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
+ if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
+ grp = iter;
+ break;
+ }
+ }
+ read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
+ if (grp)
+ break;
+ }
+
+ if (!grp) {
+ /* Increment cr and search again */
+ *new_cr = 1;
+ } else {
+ *group = grp->bb_group;
+ ac->ac_last_optimal_group = *group;
+ ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
+ }
+ return 0;
+}
+
+/*
+ * Choose next group by traversing average fragment size tree. Return 0 if next
+ * group was selected optimally. Return 1 if next group could not selected
+ * optimally (due to lock contention). Updates *new_cr if cr lvel needs an
+ * update.
+ */
+static int ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
+ int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ int avg_fragment_size, best_so_far;
+ struct rb_node *node, *found;
+ struct ext4_group_info *grp;
+
+ /*
+ * If there is contention on the lock, instead of waiting for the lock
+ * to become available, just continue searching lineraly. We'll resume
+ * our rb tree search later starting at ac->ac_last_optimal_group.
+ */
+ if (!read_trylock(&sbi->s_mb_rb_lock))
+ return 1;
+
+ if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
+ if (sbi->s_mb_stats)
+ atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
+ /* We have found something at CR 1 in the past */
+ grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
+ for (found = rb_next(&grp->bb_avg_fragment_size_rb); found != NULL;
+ found = rb_next(found)) {
+ grp = rb_entry(found, struct ext4_group_info,
+ bb_avg_fragment_size_rb);
+ if (sbi->s_mb_stats)
+ atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
+ if (likely(ext4_mb_good_group(ac, grp->bb_group, 1)))
+ break;
+ }
+
+ goto done;
+ }
+
+ node = sbi->s_mb_avg_fragment_size_root.rb_node;
+ best_so_far = 0;
+ found = NULL;
+
+ while (node) {
+ grp = rb_entry(node, struct ext4_group_info,
+ bb_avg_fragment_size_rb);
+ avg_fragment_size = 0;
+ /*
+ * Perform this check without locking, we'll lock later to confirm.
+ */
+ if (ext4_mb_good_group(ac, grp->bb_group, 1)) {
+ avg_fragment_size = grp->bb_fragments ?
+ grp->bb_free / grp->bb_fragments : 0;
+ if (!best_so_far || avg_fragment_size < best_so_far) {
+ best_so_far = avg_fragment_size;
+ found = node;
+ }
+ }
+ if (avg_fragment_size > ac->ac_g_ex.fe_len)
+ node = node->rb_right;
+ else
+ node = node->rb_left;
+ }
+
+done:
+ if (found) {
+ grp = rb_entry(found, struct ext4_group_info,
+ bb_avg_fragment_size_rb);
+ *group = grp->bb_group;
+ ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
+ } else {
+ *new_cr = 2;
+ }
+
+ read_unlock(&sbi->s_mb_rb_lock);
+ ac->ac_last_optimal_group = *group;
+ return 0;
+}
+
+/*
+ * ext4_mb_choose_next_group: choose next group for allocation.
+ *
+ * @ac Allocation Context
+ * @new_cr This is an output parameter. If the there is no good group
+ * available at current CR level, this field is updated to indicate
+ * the new cr level that should be used.
+ * @group This is an input / output parameter. As an input it indicates the
+ * last group used for allocation. As output, this field indicates
+ * the next group that should be used.
+ * @ngroups Total number of groups
+ */
+static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
+ int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+{
+ int ret;
+
+ *new_cr = ac->ac_criteria;
+
+ if (!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN) ||
+ *new_cr >= 2 ||
+ !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
+ goto inc_and_return;
+
+ if (ac->ac_groups_linear_remaining) {
+ ac->ac_groups_linear_remaining--;
+ goto inc_and_return;
+ }
+
+ if (*new_cr == 0) {
+ ret = ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
+ if (ret)
+ goto inc_and_return;
+ }
+ if (*new_cr == 1) {
+ ret = ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
+ if (ret)
+ goto inc_and_return;
+ }
+ return;
+
+inc_and_return:
+ /*
+ * Artificially restricted ngroups for non-extent
+ * files makes group > ngroups possible on first loop.
+ */
+ *group = *group + 1;
+ if (*group >= ngroups)
+ *group = 0;
+}
+
/*
* Cache the order of the largest free extent we have available in this block
* group.
@@ -751,18 +1053,33 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
static void
mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
int i;
- int bits;
+ if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
+ write_lock(&sbi->s_mb_largest_free_orders_locks[
+ grp->bb_largest_free_order]);
+ list_del_init(&grp->bb_largest_free_order_node);
+ write_unlock(&sbi->s_mb_largest_free_orders_locks[
+ grp->bb_largest_free_order]);
+ }
grp->bb_largest_free_order = -1; /* uninit */
- bits = MB_NUM_ORDERS(sb) - 1;
- for (i = bits; i >= 0; i--) {
+ for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
if (grp->bb_counters[i] > 0) {
grp->bb_largest_free_order = i;
break;
}
}
+ if (test_opt2(sb, MB_OPTIMIZE_SCAN) &&
+ grp->bb_largest_free_order >= 0 && grp->bb_free) {
+ write_lock(&sbi->s_mb_largest_free_orders_locks[
+ grp->bb_largest_free_order]);
+ list_add_tail(&grp->bb_largest_free_order_node,
+ &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
+ write_unlock(&sbi->s_mb_largest_free_orders_locks[
+ grp->bb_largest_free_order]);
+ }
}
static noinline_for_stack
@@ -818,6 +1135,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
period = get_cycles() - period;
atomic_inc(&sbi->s_mb_buddies_generated);
atomic64_add(period, &sbi->s_mb_generation_time);
+ mb_update_avg_fragment_size(sb, grp);
}
/* The buddy information is attached the buddy cache inode
@@ -1517,6 +1835,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
done:
mb_set_largest_free_order(sb, e4b->bd_info);
+ mb_update_avg_fragment_size(sb, e4b->bd_info);
mb_check_buddy(e4b);
}
@@ -1653,6 +1972,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
}
mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
+ mb_update_avg_fragment_size(e4b->bd_sb, e4b->bd_info);
ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
mb_check_buddy(e4b);
@@ -2347,17 +2667,21 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
* from the goal value specified
*/
group = ac->ac_g_ex.fe_group;
+ ac->ac_last_optimal_group = group;
+ ac->ac_groups_linear_remaining = sbi->s_mb_linear_limit;
prefetch_grp = group;
- for (i = 0; i < ngroups; group++, i++) {
- int ret = 0;
+ for (i = 0; i < ngroups; i++) {
+ int ret = 0, new_cr;
+
cond_resched();
- /*
- * Artificially restricted ngroups for non-extent
- * files makes group > ngroups possible on first loop.
- */
- if (group >= ngroups)
- group = 0;
+
+ ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
+
+ if (new_cr != cr) {
+ cr = new_cr;
+ goto repeat;
+ }
/*
* Batch reads of the block allocation bitmaps
@@ -2578,6 +2902,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
atomic64_read(&sbi->s_bal_cX_groups_considered[0]));
seq_printf(seq, "\t\tuseless_loops: %llu\n",
atomic64_read(&sbi->s_bal_cX_failed[0]));
+ seq_printf(seq, "\t\tbad_suggestions: %u\n",
+ atomic_read(&sbi->s_bal_cr0_bad_suggestions));
seq_puts(seq, "\tcr1_stats:\n");
seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[1]));
@@ -2585,6 +2911,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
atomic64_read(&sbi->s_bal_cX_groups_considered[1]));
seq_printf(seq, "\t\tuseless_loops: %llu\n",
atomic64_read(&sbi->s_bal_cX_failed[1]));
+ seq_printf(seq, "\t\tbad_suggestions: %u\n",
+ atomic_read(&sbi->s_bal_cr1_bad_suggestions));
seq_puts(seq, "\tcr2_stats:\n");
seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[2]));
@@ -2719,7 +3047,10 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
init_rwsem(&meta_group_info[i]->alloc_sem);
meta_group_info[i]->bb_free_root = RB_ROOT;
+ INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
+ RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb);
meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
+ meta_group_info[i]->bb_group = group;
mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
return 0;
@@ -2909,6 +3240,26 @@ int ext4_mb_init(struct super_block *sb)
i++;
} while (i < MB_NUM_ORDERS(sb));
+ sbi->s_mb_avg_fragment_size_root = RB_ROOT;
+ sbi->s_mb_largest_free_orders =
+ kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
+ GFP_KERNEL);
+ if (!sbi->s_mb_largest_free_orders) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ sbi->s_mb_largest_free_orders_locks =
+ kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
+ GFP_KERNEL);
+ if (!sbi->s_mb_largest_free_orders_locks) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
+ INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
+ rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
+ }
+ rwlock_init(&sbi->s_mb_rb_lock);
spin_lock_init(&sbi->s_md_lock);
sbi->s_mb_free_pending = 0;
@@ -2961,6 +3312,10 @@ int ext4_mb_init(struct super_block *sb)
spin_lock_init(&lg->lg_prealloc_lock);
}
+ if (blk_queue_nonrot(bdev_get_queue(sb->s_bdev)))
+ sbi->s_mb_linear_limit = 0;
+ else
+ sbi->s_mb_linear_limit = MB_DEFAULT_LINEAR_LIMIT;
/* init file for buddy data */
ret = ext4_mb_init_backend(sb);
if (ret != 0)
@@ -2972,6 +3327,8 @@ int ext4_mb_init(struct super_block *sb)
free_percpu(sbi->s_locality_groups);
sbi->s_locality_groups = NULL;
out:
+ kfree(sbi->s_mb_largest_free_orders);
+ kfree(sbi->s_mb_largest_free_orders_locks);
kfree(sbi->s_mb_offsets);
sbi->s_mb_offsets = NULL;
kfree(sbi->s_mb_maxs);
@@ -3028,6 +3385,8 @@ int ext4_mb_release(struct super_block *sb)
kvfree(group_info);
rcu_read_unlock();
}
+ kfree(sbi->s_mb_largest_free_orders);
+ kfree(sbi->s_mb_largest_free_orders_locks);
kfree(sbi->s_mb_offsets);
kfree(sbi->s_mb_maxs);
iput(sbi->s_buddy_cache);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 68111a10cfee..02585e3cbcad 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -78,6 +78,18 @@
*/
#define MB_DEFAULT_MAX_INODE_PREALLOC 512
+/*
+ * Number of groups to search linearly before performing group scanning
+ * optimization.
+ */
+#define MB_DEFAULT_LINEAR_LIMIT 4
+
+/*
+ * Minimum number of groups that should be present in the file system to perform
+ * group scanning optimizations.
+ */
+#define MB_DEFAULT_LINEAR_SCAN_THRESHOLD 16
+
/*
* Number of valid buddy orders
*/
@@ -166,11 +178,14 @@ struct ext4_allocation_context {
/* copy of the best found extent taken before preallocation efforts */
struct ext4_free_extent ac_f_ex;
+ ext4_group_t ac_last_optimal_group;
+ __u32 ac_groups_considered;
+ __u32 ac_flags; /* allocation hints */
__u16 ac_groups_scanned;
+ __u16 ac_groups_linear_remaining;
__u16 ac_found;
__u16 ac_tail;
__u16 ac_buddy;
- __u16 ac_flags; /* allocation hints */
__u8 ac_status;
__u8 ac_criteria;
__u8 ac_2order; /* if request is to allocate 2^N blocks and
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3a2cd9fb7e73..6116640081c0 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1687,7 +1687,7 @@ enum {
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
- Opt_prefetch_block_bitmaps,
+ Opt_prefetch_block_bitmaps, Opt_mb_optimize_scan,
#ifdef CONFIG_EXT4_DEBUG
Opt_fc_debug_max_replay, Opt_fc_debug_force
#endif
@@ -1788,6 +1788,7 @@ static const match_table_t tokens = {
{Opt_nombcache, "nombcache"},
{Opt_nombcache, "no_mbcache"}, /* for backward compatibility */
{Opt_prefetch_block_bitmaps, "prefetch_block_bitmaps"},
+ {Opt_mb_optimize_scan, "mb_optimize_scan=%d"},
{Opt_removed, "check=none"}, /* mount option from ext2/3 */
{Opt_removed, "nocheck"}, /* mount option from ext2/3 */
{Opt_removed, "reservation"}, /* mount option from ext2/3 */
@@ -1820,6 +1821,8 @@ static ext4_fsblk_t get_sb_block(void **data)
}
#define DEFAULT_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
+#define DEFAULT_MB_OPTIMIZE_SCAN (-1)
+
static const char deprecated_msg[] =
"Mount option \"%s\" will be removed by %s\n"
"Contact [email protected] if you think we should keep it.\n";
@@ -2008,6 +2011,7 @@ static const struct mount_opts {
{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
{Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
MOPT_SET},
+ {Opt_mb_optimize_scan, EXT4_MOUNT2_MB_OPTIMIZE_SCAN, MOPT_GTE0},
#ifdef CONFIG_EXT4_DEBUG
{Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
@@ -2092,6 +2096,7 @@ static int ext4_set_test_dummy_encryption(struct super_block *sb,
struct ext4_parsed_options {
unsigned long journal_devnum;
unsigned int journal_ioprio;
+ int mb_optimize_scan;
};
static int handle_mount_opt(struct super_block *sb, char *opt, int token,
@@ -2388,6 +2393,13 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
sbi->s_mount_opt |= m->mount_opt;
} else if (token == Opt_data_err_ignore) {
sbi->s_mount_opt &= ~m->mount_opt;
+ } else if (token == Opt_mb_optimize_scan) {
+ if (arg != 0 && arg != 1) {
+ ext4_msg(sb, KERN_WARNING,
+ "mb_optimize_scan should be set to 0 or 1.");
+ return -1;
+ }
+ parsed_opts->mb_optimize_scan = arg;
} else {
if (!args->from)
arg = 1;
@@ -4034,6 +4046,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
/* Set defaults for the variables that will be set during parsing */
parsed_opts.journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
parsed_opts.journal_devnum = 0;
+ parsed_opts.mb_optimize_scan = DEFAULT_MB_OPTIMIZE_SCAN;
if ((data && !orig_data) || !sbi)
goto out_free_base;
@@ -4984,6 +4997,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
ext4_fc_replay_cleanup(sb);
ext4_ext_init(sb);
+
+ /*
+ * Enable optimize_scan if number of groups is > threshold. This can be
+ * turned off by passing "mb_optimize_scan=0". This can also be
+ * turned on forcefully by passing "mb_optimize_scan=1".
+ */
+ if (parsed_opts.mb_optimize_scan == 1)
+ set_opt2(sb, MB_OPTIMIZE_SCAN);
+ else if (parsed_opts.mb_optimize_scan == 0)
+ clear_opt2(sb, MB_OPTIMIZE_SCAN);
+ else if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD)
+ set_opt2(sb, MB_OPTIMIZE_SCAN);
+
err = ext4_mb_init(sb);
if (err) {
ext4_msg(sb, KERN_ERR, "failed to initialize mballoc (%d)",
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 59ca9d73b42f..16b8a838f631 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -213,6 +213,7 @@ EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
EXT4_RW_ATTR_SBI_UI(mb_max_inode_prealloc, s_mb_max_inode_prealloc);
+EXT4_RW_ATTR_SBI_UI(mb_linear_limit, s_mb_linear_limit);
EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
EXT4_ATTR(trigger_fs_error, 0200, trigger_test_error);
EXT4_RW_ATTR_SBI_UI(err_ratelimit_interval_ms, s_err_ratelimit_state.interval);
@@ -260,6 +261,7 @@ static struct attribute *ext4_attrs[] = {
ATTR_LIST(mb_stream_req),
ATTR_LIST(mb_group_prealloc),
ATTR_LIST(mb_max_inode_prealloc),
+ ATTR_LIST(mb_linear_limit),
ATTR_LIST(max_writeback_mb_bump),
ATTR_LIST(extent_max_zeroout_kb),
ATTR_LIST(trigger_fs_error),
--
2.31.0.291.g576ba9dcdaf-goog
A few arrays in mballoc.c use the total number of valid orders as
their size. Currently, this value is set as "sb->s_blocksize_bits +
2". This makes code harder to read. So, instead add a new macro
MB_NUM_ORDERS(sb) to make the code more readable.
Signed-off-by: Harshad Shirwadkar <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
---
fs/ext4/mballoc.c | 19 ++++++++++---------
fs/ext4/mballoc.h | 5 +++++
2 files changed, 15 insertions(+), 9 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a4b71c9c1e66..15127d815461 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -756,7 +756,7 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
grp->bb_largest_free_order = -1; /* uninit */
- bits = sb->s_blocksize_bits + 1;
+ bits = MB_NUM_ORDERS(sb) - 1;
for (i = bits; i >= 0; i--) {
if (grp->bb_counters[i] > 0) {
grp->bb_largest_free_order = i;
@@ -957,7 +957,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore, gfp_t gfp)
grinfo->bb_fragments = 0;
memset(grinfo->bb_counters, 0,
sizeof(*grinfo->bb_counters) *
- (sb->s_blocksize_bits+2));
+ (MB_NUM_ORDERS(sb)));
/*
* incore got set to the group block bitmap below
*/
@@ -1928,7 +1928,7 @@ void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac,
int max;
BUG_ON(ac->ac_2order <= 0);
- for (i = ac->ac_2order; i <= sb->s_blocksize_bits + 1; i++) {
+ for (i = ac->ac_2order; i < MB_NUM_ORDERS(sb); i++) {
if (grp->bb_counters[i] == 0)
continue;
@@ -2107,7 +2107,7 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
if (free < ac->ac_g_ex.fe_len)
return false;
- if (ac->ac_2order > ac->ac_sb->s_blocksize_bits+1)
+ if (ac->ac_2order >= MB_NUM_ORDERS(ac->ac_sb))
return true;
if (grp->bb_largest_free_order < ac->ac_2order)
@@ -2315,13 +2315,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
* We also support searching for power-of-two requests only for
* requests upto maximum buddy size we have constructed.
*/
- if (i >= sbi->s_mb_order2_reqs && i <= sb->s_blocksize_bits + 2) {
+ if (i >= sbi->s_mb_order2_reqs && i <= MB_NUM_ORDERS(sb)) {
/*
* This should tell if fe_len is exactly power of 2
*/
if ((ac->ac_g_ex.fe_len & (~(1 << (i - 1)))) == 0)
ac->ac_2order = array_index_nospec(i - 1,
- sb->s_blocksize_bits + 2);
+ MB_NUM_ORDERS(sb));
}
/* if stream allocation is enabled, use global goal */
@@ -2873,7 +2873,7 @@ int ext4_mb_init(struct super_block *sb)
unsigned max;
int ret;
- i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_offsets);
+ i = MB_NUM_ORDERS(sb) * sizeof(*sbi->s_mb_offsets);
sbi->s_mb_offsets = kmalloc(i, GFP_KERNEL);
if (sbi->s_mb_offsets == NULL) {
@@ -2881,7 +2881,7 @@ int ext4_mb_init(struct super_block *sb)
goto out;
}
- i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_maxs);
+ i = MB_NUM_ORDERS(sb) * sizeof(*sbi->s_mb_maxs);
sbi->s_mb_maxs = kmalloc(i, GFP_KERNEL);
if (sbi->s_mb_maxs == NULL) {
ret = -ENOMEM;
@@ -2907,7 +2907,8 @@ int ext4_mb_init(struct super_block *sb)
offset_incr = offset_incr >> 1;
max = max >> 1;
i++;
- } while (i <= sb->s_blocksize_bits + 1);
+ } while (i < MB_NUM_ORDERS(sb));
+
spin_lock_init(&sbi->s_md_lock);
sbi->s_mb_free_pending = 0;
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index e75b4749aa1c..68111a10cfee 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -78,6 +78,11 @@
*/
#define MB_DEFAULT_MAX_INODE_PREALLOC 512
+/*
+ * Number of valid buddy orders
+ */
+#define MB_NUM_ORDERS(sb) ((sb)->s_blocksize_bits + 2)
+
struct ext4_free_data {
/* this links the free block information from sb_info */
struct list_head efd_list;
--
2.31.0.291.g576ba9dcdaf-goog
s_mb_buddies_generated gets used later in this patch series to
determine if the cr 0 and cr 1 optimziations should be performed or
not. Currently, s_mb_buddies_generated is protected under a
spin_lock. In the allocation path, it is better if we don't depend on
the lock and instead read the value atomically. In order to do that,
we drop s_bal_lock altogether and we convert the only two protected
fields by it s_mb_buddies_generated and s_mb_generation_time to atomic
type.
Signed-off-by: Harshad Shirwadkar <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
---
fs/ext4/ext4.h | 5 ++---
fs/ext4/mballoc.c | 13 +++++--------
2 files changed, 7 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2866d249f3d2..cb0724b87d54 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1552,9 +1552,8 @@ struct ext4_sb_info {
atomic_t s_bal_goals; /* goal hits */
atomic_t s_bal_breaks; /* too long searches */
atomic_t s_bal_2orders; /* 2^order hits */
- spinlock_t s_bal_lock;
- unsigned long s_mb_buddies_generated;
- unsigned long long s_mb_generation_time;
+ atomic_t s_mb_buddies_generated; /* number of buddies generated */
+ atomic64_t s_mb_generation_time;
atomic_t s_mb_lost_chunks;
atomic_t s_mb_preallocated;
atomic_t s_mb_discarded;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 99bf091fee10..07b78a3cc421 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -816,10 +816,8 @@ void ext4_mb_generate_buddy(struct super_block *sb,
clear_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state));
period = get_cycles() - period;
- spin_lock(&sbi->s_bal_lock);
- sbi->s_mb_buddies_generated++;
- sbi->s_mb_generation_time += period;
- spin_unlock(&sbi->s_bal_lock);
+ atomic_inc(&sbi->s_mb_buddies_generated);
+ atomic64_add(period, &sbi->s_mb_generation_time);
}
/* The buddy information is attached the buddy cache inode
@@ -2843,7 +2841,6 @@ int ext4_mb_init(struct super_block *sb)
} while (i <= sb->s_blocksize_bits + 1);
spin_lock_init(&sbi->s_md_lock);
- spin_lock_init(&sbi->s_bal_lock);
sbi->s_mb_free_pending = 0;
INIT_LIST_HEAD(&sbi->s_freed_data_list);
@@ -2979,9 +2976,9 @@ int ext4_mb_release(struct super_block *sb)
atomic_read(&sbi->s_bal_breaks),
atomic_read(&sbi->s_mb_lost_chunks));
ext4_msg(sb, KERN_INFO,
- "mballoc: %lu generated and it took %Lu",
- sbi->s_mb_buddies_generated,
- sbi->s_mb_generation_time);
+ "mballoc: %u generated and it took %llu",
+ atomic_read(&sbi->s_mb_buddies_generated),
+ atomic64_read(&sbi->s_mb_generation_time));
ext4_msg(sb, KERN_INFO,
"mballoc: %u preallocated, %u discarded",
atomic_read(&sbi->s_mb_preallocated),
--
2.31.0.291.g576ba9dcdaf-goog
Add new stats for measuring the performance of mballoc. This patch is
forked from Artem Blagodarenko's work that can be found here:
https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel8/ext4-simple-blockalloc.patch
This patch reorganizes the stats by cr level. This is how the output
looks like:
mballoc:
reqs: 0
success: 0
groups_scanned: 0
cr0_stats:
hits: 0
groups_considered: 0
useless_loops: 0
bad_suggestions: 0
cr1_stats:
hits: 0
groups_considered: 0
useless_loops: 0
bad_suggestions: 0
cr2_stats:
hits: 0
groups_considered: 0
useless_loops: 0
cr3_stats:
hits: 0
groups_considered: 0
useless_loops: 0
extents_scanned: 0
goal_hits: 0
2^n_hits: 0
breaks: 0
lost: 0
buddies_generated: 0/40
buddies_time_used: 0
preallocated: 0
discarded: 0
Signed-off-by: Harshad Shirwadkar <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
---
fs/ext4/ext4.h | 5 ++++
fs/ext4/mballoc.c | 75 +++++++++++++++++++++++++++++++++++++++++++++--
fs/ext4/sysfs.c | 2 ++
3 files changed, 80 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index cb0724b87d54..85eeeba3bca3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1549,9 +1549,13 @@ struct ext4_sb_info {
atomic_t s_bal_success; /* we found long enough chunks */
atomic_t s_bal_allocated; /* in blocks */
atomic_t s_bal_ex_scanned; /* total extents scanned */
+ atomic_t s_bal_groups_scanned; /* number of groups scanned */
atomic_t s_bal_goals; /* goal hits */
atomic_t s_bal_breaks; /* too long searches */
atomic_t s_bal_2orders; /* 2^order hits */
+ atomic64_t s_bal_cX_groups_considered[4];
+ atomic64_t s_bal_cX_hits[4];
+ atomic64_t s_bal_cX_failed[4]; /* cX loop didn't find blocks */
atomic_t s_mb_buddies_generated; /* number of buddies generated */
atomic64_t s_mb_generation_time;
atomic_t s_mb_lost_chunks;
@@ -2808,6 +2812,7 @@ int __init ext4_fc_init_dentry_cache(void);
extern const struct seq_operations ext4_mb_seq_groups_ops;
extern long ext4_mb_stats;
extern long ext4_mb_max_to_scan;
+extern int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset);
extern int ext4_mb_init(struct super_block *);
extern int ext4_mb_release(struct super_block *);
extern ext4_fsblk_t ext4_mb_new_blocks(handle_t *,
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 07b78a3cc421..a4b71c9c1e66 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2146,6 +2146,8 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
ext4_grpblk_t free;
int ret = 0;
+ if (sbi->s_mb_stats)
+ atomic64_inc(&sbi->s_bal_cX_groups_considered[ac->ac_criteria]);
if (should_lock)
ext4_lock_group(sb, group);
free = grp->bb_free;
@@ -2420,6 +2422,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
if (ac->ac_status != AC_STATUS_CONTINUE)
break;
}
+ /* Processed all groups and haven't found blocks */
+ if (sbi->s_mb_stats && i == ngroups)
+ atomic64_inc(&sbi->s_bal_cX_failed[cr]);
}
if (ac->ac_b_ex.fe_len > 0 && ac->ac_status != AC_STATUS_FOUND &&
@@ -2449,6 +2454,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
goto repeat;
}
}
+
+ if (sbi->s_mb_stats && ac->ac_status == AC_STATUS_FOUND)
+ atomic64_inc(&sbi->s_bal_cX_hits[ac->ac_criteria]);
out:
if (!err && ac->ac_status != AC_STATUS_FOUND && first_err)
err = first_err;
@@ -2548,6 +2556,67 @@ const struct seq_operations ext4_mb_seq_groups_ops = {
.show = ext4_mb_seq_groups_show,
};
+int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
+{
+ struct super_block *sb = (struct super_block *)seq->private;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ seq_puts(seq, "mballoc:\n");
+ if (!sbi->s_mb_stats) {
+ seq_puts(seq, "\tmb stats collection turned off.\n");
+ seq_puts(seq, "\tTo enable, please write \"1\" to sysfs file mb_stats.\n");
+ return 0;
+ }
+ seq_printf(seq, "\treqs: %u\n", atomic_read(&sbi->s_bal_reqs));
+ seq_printf(seq, "\tsuccess: %u\n", atomic_read(&sbi->s_bal_success));
+
+ seq_printf(seq, "\tgroups_scanned: %u\n", atomic_read(&sbi->s_bal_groups_scanned));
+
+ seq_puts(seq, "\tcr0_stats:\n");
+ seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[0]));
+ seq_printf(seq, "\t\tgroups_considered: %llu\n",
+ atomic64_read(&sbi->s_bal_cX_groups_considered[0]));
+ seq_printf(seq, "\t\tuseless_loops: %llu\n",
+ atomic64_read(&sbi->s_bal_cX_failed[0]));
+
+ seq_puts(seq, "\tcr1_stats:\n");
+ seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[1]));
+ seq_printf(seq, "\t\tgroups_considered: %llu\n",
+ atomic64_read(&sbi->s_bal_cX_groups_considered[1]));
+ seq_printf(seq, "\t\tuseless_loops: %llu\n",
+ atomic64_read(&sbi->s_bal_cX_failed[1]));
+
+ seq_puts(seq, "\tcr2_stats:\n");
+ seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[2]));
+ seq_printf(seq, "\t\tgroups_considered: %llu\n",
+ atomic64_read(&sbi->s_bal_cX_groups_considered[2]));
+ seq_printf(seq, "\t\tuseless_loops: %llu\n",
+ atomic64_read(&sbi->s_bal_cX_failed[2]));
+
+ seq_puts(seq, "\tcr3_stats:\n");
+ seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[3]));
+ seq_printf(seq, "\t\tgroups_considered: %llu\n",
+ atomic64_read(&sbi->s_bal_cX_groups_considered[3]));
+ seq_printf(seq, "\t\tuseless_loops: %llu\n",
+ atomic64_read(&sbi->s_bal_cX_failed[3]));
+ seq_printf(seq, "\textents_scanned: %u\n", atomic_read(&sbi->s_bal_ex_scanned));
+ seq_printf(seq, "\t\tgoal_hits: %u\n", atomic_read(&sbi->s_bal_goals));
+ seq_printf(seq, "\t\t2^n_hits: %u\n", atomic_read(&sbi->s_bal_2orders));
+ seq_printf(seq, "\t\tbreaks: %u\n", atomic_read(&sbi->s_bal_breaks));
+ seq_printf(seq, "\t\tlost: %u\n", atomic_read(&sbi->s_mb_lost_chunks));
+
+ seq_printf(seq, "\tbuddies_generated: %u/%u\n",
+ atomic_read(&sbi->s_mb_buddies_generated),
+ ext4_get_groups_count(sb));
+ seq_printf(seq, "\tbuddies_time_used: %llu\n",
+ atomic64_read(&sbi->s_mb_generation_time));
+ seq_printf(seq, "\tpreallocated: %u\n",
+ atomic_read(&sbi->s_mb_preallocated));
+ seq_printf(seq, "\tdiscarded: %u\n",
+ atomic_read(&sbi->s_mb_discarded));
+ return 0;
+}
+
static struct kmem_cache *get_groupinfo_cache(int blocksize_bits)
{
int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE;
@@ -2968,9 +3037,10 @@ int ext4_mb_release(struct super_block *sb)
atomic_read(&sbi->s_bal_reqs),
atomic_read(&sbi->s_bal_success));
ext4_msg(sb, KERN_INFO,
- "mballoc: %u extents scanned, %u goal hits, "
+ "mballoc: %u extents scanned, %u groups scanned, %u goal hits, "
"%u 2^N hits, %u breaks, %u lost",
atomic_read(&sbi->s_bal_ex_scanned),
+ atomic_read(&sbi->s_bal_groups_scanned),
atomic_read(&sbi->s_bal_goals),
atomic_read(&sbi->s_bal_2orders),
atomic_read(&sbi->s_bal_breaks),
@@ -3573,12 +3643,13 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
{
struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
- if (sbi->s_mb_stats && ac->ac_g_ex.fe_len > 1) {
+ if (sbi->s_mb_stats && ac->ac_g_ex.fe_len >= 1) {
atomic_inc(&sbi->s_bal_reqs);
atomic_add(ac->ac_b_ex.fe_len, &sbi->s_bal_allocated);
if (ac->ac_b_ex.fe_len >= ac->ac_o_ex.fe_len)
atomic_inc(&sbi->s_bal_success);
atomic_add(ac->ac_found, &sbi->s_bal_ex_scanned);
+ atomic_add(ac->ac_groups_scanned, &sbi->s_bal_groups_scanned);
if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start &&
ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group)
atomic_inc(&sbi->s_bal_goals);
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 075aa3a19ff5..59ca9d73b42f 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -521,6 +521,8 @@ int ext4_register_sysfs(struct super_block *sb)
ext4_fc_info_show, sb);
proc_create_seq_data("mb_groups", S_IRUGO, sbi->s_proc,
&ext4_mb_seq_groups_ops, sb);
+ proc_create_single_data("mb_stats", 0444, sbi->s_proc,
+ ext4_seq_mb_stats_show, sb);
}
return 0;
}
--
2.31.0.291.g576ba9dcdaf-goog
This patch adds a new file "mb_structs_summary" which allows us to see
the summary of the new allocator structures added in this
series. Here's the sample output of file:
optimize_scan: 1
max_free_order_lists:
list_order_0_groups: 0
list_order_1_groups: 0
list_order_2_groups: 0
list_order_3_groups: 0
list_order_4_groups: 0
list_order_5_groups: 0
list_order_6_groups: 0
list_order_7_groups: 0
list_order_8_groups: 0
list_order_9_groups: 0
list_order_10_groups: 0
list_order_11_groups: 0
list_order_12_groups: 0
list_order_13_groups: 40
fragment_size_tree:
tree_min: 16384
tree_max: 32768
tree_nodes: 40
Signed-off-by: Harshad Shirwadkar <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/mballoc.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/sysfs.c | 2 ++
3 files changed, 89 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 5930c8cb5159..f6a36a0e07c1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2822,6 +2822,7 @@ int __init ext4_fc_init_dentry_cache(void);
/* mballoc.c */
extern const struct seq_operations ext4_mb_seq_groups_ops;
+extern const struct seq_operations ext4_mb_seq_structs_summary_ops;
extern long ext4_mb_stats;
extern long ext4_mb_max_to_scan;
extern int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index cbf9a89c0ef5..0cd428bf2597 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2945,6 +2945,92 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
return 0;
}
+static void *ext4_mb_seq_structs_summary_start(struct seq_file *seq, loff_t *pos)
+{
+ struct super_block *sb = PDE_DATA(file_inode(seq->file));
+ unsigned long position;
+
+ read_lock(&EXT4_SB(sb)->s_mb_rb_lock);
+
+ if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + 1)
+ return NULL;
+ position = *pos + 1;
+ return (void *) ((unsigned long) position);
+}
+
+static void *ext4_mb_seq_structs_summary_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ struct super_block *sb = PDE_DATA(file_inode(seq->file));
+ unsigned long position;
+
+ ++*pos;
+ if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + 1)
+ return NULL;
+ position = *pos + 1;
+ return (void *) ((unsigned long) position);
+}
+
+static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v)
+{
+ struct super_block *sb = PDE_DATA(file_inode(seq->file));
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ unsigned long position = ((unsigned long) v);
+ struct ext4_group_info *grp;
+ struct rb_node *n;
+ unsigned int count, min, max;
+
+ position--;
+ if (position >= MB_NUM_ORDERS(sb)) {
+ seq_puts(seq, "fragment_size_tree:\n");
+ n = rb_first(&sbi->s_mb_avg_fragment_size_root);
+ if (!n) {
+ seq_puts(seq, "\ttree_min: 0\n\ttree_max: 0\n\ttree_nodes: 0\n");
+ return 0;
+ }
+ grp = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb);
+ min = grp->bb_fragments ? grp->bb_free / grp->bb_fragments : 0;
+ count = 1;
+ while (rb_next(n)) {
+ count++;
+ n = rb_next(n);
+ }
+ grp = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb);
+ max = grp->bb_fragments ? grp->bb_free / grp->bb_fragments : 0;
+
+ seq_printf(seq, "\ttree_min: %u\n\ttree_max: %u\n\ttree_nodes: %u\n",
+ min, max, count);
+ return 0;
+ }
+
+ if (position == 0) {
+ seq_printf(seq, "optimize_scan: %d\n",
+ test_opt2(sb, MB_OPTIMIZE_SCAN) ? 1 : 0);
+ seq_puts(seq, "max_free_order_lists:\n");
+ }
+ count = 0;
+ list_for_each_entry(grp, &sbi->s_mb_largest_free_orders[position],
+ bb_largest_free_order_node)
+ count++;
+ seq_printf(seq, "\tlist_order_%u_groups: %u\n",
+ (unsigned int)position, count);
+
+ return 0;
+}
+
+static void ext4_mb_seq_structs_summary_stop(struct seq_file *seq, void *v)
+{
+ struct super_block *sb = PDE_DATA(file_inode(seq->file));
+
+ read_unlock(&EXT4_SB(sb)->s_mb_rb_lock);
+}
+
+const struct seq_operations ext4_mb_seq_structs_summary_ops = {
+ .start = ext4_mb_seq_structs_summary_start,
+ .next = ext4_mb_seq_structs_summary_next,
+ .stop = ext4_mb_seq_structs_summary_stop,
+ .show = ext4_mb_seq_structs_summary_show,
+};
+
static struct kmem_cache *get_groupinfo_cache(int blocksize_bits)
{
int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE;
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 16b8a838f631..4a3b78684f83 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -525,6 +525,8 @@ int ext4_register_sysfs(struct super_block *sb)
&ext4_mb_seq_groups_ops, sb);
proc_create_single_data("mb_stats", 0444, sbi->s_proc,
ext4_seq_mb_stats_show, sb);
+ proc_create_seq_data("mb_structs_summary", 0444, sbi->s_proc,
+ &ext4_mb_seq_structs_summary_ops, sb);
}
return 0;
}
--
2.31.0.291.g576ba9dcdaf-goog
On 3/25/21 4:49 AM, Harshad Shirwadkar wrote:
> Instead of traversing through groups linearly, scan groups in specific
> orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
> largest free order >= the order of the request. So, with this patch,
> we maintain lists for each possible order and insert each group into a
> list based on the largest free order in its buddy bitmap. During cr 0
> allocation, we traverse these lists in the increasing order of largest
> free orders. This allows us to find a group with the best available cr
> 0 match in constant time. If nothing can be found, we fallback to cr 1
> immediately.
>
> At CR1, the story is slightly different. We want to traverse in the
> order of increasing average fragment size. For CR1, we maintain a rb
> tree of groupinfos which is sorted by average fragment size. Instead
> of traversing linearly, at CR1, we traverse in the order of increasing
> average fragment size, starting at the most optimal group. This brings
> down cr 1 search complexity to log(num groups).
>
> For cr >= 2, we just perform the linear search as before. Also, in
> case of lock contention, we intermittently fallback to linear search
> even in CR 0 and CR 1 cases. This allows us to proceed during the
> allocation path even in case of high contention.
>
> There is an opportunity to do optimization at CR2 too. That's because
> at CR2 we only consider groups where bb_free counter (number of free
> blocks) is greater than the request extent size. That's left as future
> work.
>
> All the changes introduced in this patch are protected under a new
> mount option "mb_optimize_scan".
>
> With this patchset, following experiment was performed:
>
> Created a highly fragmented disk of size 65TB. The disk had no
> contiguous 2M regions. Following command was run consecutively for 3
> times:
>
> time dd if=/dev/urandom of=file bs=2M count=10
>
> Here are the results with and without cr 0/1 optimizations introduced
> in this patch:
>
> |---------+------------------------------+---------------------------|
> | | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
> |---------+------------------------------+---------------------------|
> | 1st run | 5m1.871s | 2m47.642s |
> | 2nd run | 2m28.390s | 0m0.611s |
> | 3rd run | 2m26.530s | 0m1.255s |
> |---------+------------------------------+---------------------------|
>
> Signed-off-by: Harshad Shirwadkar <[email protected]>
> Reported-by: kernel test robot <[email protected]>
> Reported-by: Dan Carpenter <[email protected]>
> Reviewed-by: Andreas Dilger <[email protected]>
> ---
> fs/ext4/ext4.h | 19 ++-
> fs/ext4/mballoc.c | 381 ++++++++++++++++++++++++++++++++++++++++++++--
> fs/ext4/mballoc.h | 17 ++-
> fs/ext4/super.c | 28 +++-
> fs/ext4/sysfs.c | 2 +
> 5 files changed, 432 insertions(+), 15 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 85eeeba3bca3..5930c8cb5159 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -162,7 +162,10 @@ enum SHIFT_DIRECTION {
> #define EXT4_MB_USE_RESERVED 0x2000
> /* Do strict check for free blocks while retrying block allocation */
> #define EXT4_MB_STRICT_CHECK 0x4000
> -
> +/* Large fragment size list lookup succeeded at least once for cr = 0 */
> +#define EXT4_MB_CR0_OPTIMIZED 0x8000
> +/* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
> +#define EXT4_MB_CR1_OPTIMIZED 0x00010000
> struct ext4_allocation_request {
> /* target inode for block we're allocating */
> struct inode *inode;
> @@ -1247,7 +1250,9 @@ struct ext4_inode_info {
> #define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */
> #define EXT4_MOUNT2_DAX_NEVER 0x00000020 /* Do not allow Direct Access */
> #define EXT4_MOUNT2_DAX_INODE 0x00000040 /* For printing options only */
> -
> +#define EXT4_MOUNT2_MB_OPTIMIZE_SCAN 0x00000080 /* Optimize group
> + * scanning in mballoc
> + */
>
> #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
> ~EXT4_MOUNT_##opt
> @@ -1527,9 +1532,14 @@ struct ext4_sb_info {
> unsigned int s_mb_free_pending;
> struct list_head s_freed_data_list; /* List of blocks to be freed
> after commit completed */
> + struct rb_root s_mb_avg_fragment_size_root;
> + rwlock_t s_mb_rb_lock;
> + struct list_head *s_mb_largest_free_orders;
> + rwlock_t *s_mb_largest_free_orders_locks;
>
> /* tunables */
> unsigned long s_stripe;
> + unsigned int s_mb_linear_limit;
> unsigned int s_mb_stream_request;
> unsigned int s_mb_max_to_scan;
> unsigned int s_mb_min_to_scan;
> @@ -1553,6 +1563,8 @@ struct ext4_sb_info {
> atomic_t s_bal_goals; /* goal hits */
> atomic_t s_bal_breaks; /* too long searches */
> atomic_t s_bal_2orders; /* 2^order hits */
> + atomic_t s_bal_cr0_bad_suggestions;
> + atomic_t s_bal_cr1_bad_suggestions;
> atomic64_t s_bal_cX_groups_considered[4];
> atomic64_t s_bal_cX_hits[4];
> atomic64_t s_bal_cX_failed[4]; /* cX loop didn't find blocks */
> @@ -3309,11 +3321,14 @@ struct ext4_group_info {
> ext4_grpblk_t bb_free; /* total free blocks */
> ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
> ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */
> + ext4_group_t bb_group; /* Group number */
> struct list_head bb_prealloc_list;
> #ifdef DOUBLE_CHECK
> void *bb_bitmap;
> #endif
> struct rw_semaphore alloc_sem;
> + struct rb_node bb_avg_fragment_size_rb;
> + struct list_head bb_largest_free_order_node;
> ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
> * regions, index is order.
> * bb_counters[3] = 5 means
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 15127d815461..cbf9a89c0ef5 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -127,11 +127,50 @@
> * smallest multiple of the stripe value (sbi->s_stripe) which is
> * greater than the default mb_group_prealloc.
> *
> + * If "mb_optimize_scan" mount option is set, we maintain in memory group info
> + * structures in two data structures:
> + *
> + * 1) Array of largest free order lists (sbi->s_mb_largest_free_orders)
> + *
> + * Locking: sbi->s_mb_largest_free_orders_locks(array of rw locks)
> + *
> + * This is an array of lists where the index in the array represents the
> + * largest free order in the buddy bitmap of the participating group infos of
> + * that list. So, there are exactly MB_NUM_ORDERS(sb) (which means total
> + * number of buddy bitmap orders possible) number of lists. Group-infos are
> + * placed in appropriate lists.
> + *
> + * 2) Average fragment size rb tree (sbi->s_mb_avg_fragment_size_root)
> + *
> + * Locking: sbi->s_mb_rb_lock (rwlock)
> + *
> + * This is a red black tree consisting of group infos and the tree is sorted
> + * by average fragment sizes (which is calculated as ext4_group_info->bb_free
> + * / ext4_group_info->bb_fragments).
> + *
> + * When "mb_optimize_scan" mount option is set, mballoc consults the above data
> + * structures to decide the order in which groups are to be traversed for
> + * fulfilling an allocation request.
> + *
> + * At CR = 0, we look for groups which have the largest_free_order >= the order
> + * of the request. We directly look at the largest free order list in the data
> + * structure (1) above where largest_free_order = order of the request. If that
> + * list is empty, we look at remaining list in the increasing order of
> + * largest_free_order. This allows us to perform CR = 0 lookup in O(1) time.
> + *
> + * At CR = 1, we only consider groups where average fragment size > request
> + * size. So, we lookup a group which has average fragment size just above or
> + * equal to request size using our rb tree (data structure 2) in O(log N) time.
> + *
> + * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
> + * linear order which requires O(N) search time for each CR 0 and CR 1 phase.
> + *
> * The regular allocator (using the buddy cache) supports a few tunables.
> *
> * /sys/fs/ext4/<partition>/mb_min_to_scan
> * /sys/fs/ext4/<partition>/mb_max_to_scan
> * /sys/fs/ext4/<partition>/mb_order2_req
> + * /sys/fs/ext4/<partition>/mb_linear_limit
> *
> * The regular allocator uses buddy scan only if the request len is power of
> * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The
> @@ -149,6 +188,16 @@
> * can be used for allocation. ext4_mb_good_group explains how the groups are
> * checked.
> *
> + * When "mb_optimize_scan" is turned on, as mentioned above, the groups may not
> + * get traversed linearly. That may result in subsequent allocations being not
> + * close to each other. And so, the underlying device may get filled up in a
> + * non-linear fashion. While that may not matter on non-rotational devices, for
> + * rotational devices that may result in higher seek times. "mb_linear_limit"
> + * tells mballoc how many groups mballoc should search linearly before
> + * performing consulting above data structures for more efficient lookups. For
> + * non rotational devices, this value defaults to 0 and for rotational devices
> + * this is set to MB_DEFAULT_LINEAR_LIMIT.
> + *
> * Both the prealloc space are getting populated as above. So for the first
> * request we will hit the buddy cache which will result in this prealloc
> * space getting filled. The prealloc space is then later used for the
> @@ -299,6 +348,8 @@
> * - bitlock on a group (group)
> * - object (inode/locality) (object)
> * - per-pa lock (pa)
> + * - cr0 lists lock (cr0)
> + * - cr1 tree lock (cr1)
> *
> * Paths:
> * - new pa
> @@ -328,6 +379,9 @@
> * group
> * object
> *
> + * - allocation path (ext4_mb_regular_allocator)
> + * group
> + * cr0/cr1
> */
> static struct kmem_cache *ext4_pspace_cachep;
> static struct kmem_cache *ext4_ac_cachep;
> @@ -351,6 +405,9 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
> ext4_group_t group);
> static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
>
> +static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
> + ext4_group_t group, int cr);
> +
> /*
> * The algorithm using this percpu seq counter goes below:
> * 1. We sample the percpu discard_pa_seq counter before trying for block
> @@ -744,6 +801,251 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> }
> }
>
> +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
> + int (*cmp)(struct rb_node *, struct rb_node *))
> +{
> + struct rb_node **iter = &root->rb_node, *parent = NULL;
> +
> + while (*iter) {
> + parent = *iter;
> + if (cmp(new, *iter) > 0)
> + iter = &((*iter)->rb_left);
> + else
> + iter = &((*iter)->rb_right);
> + }
> +
> + rb_link_node(new, parent, iter);
> + rb_insert_color(new, root);
> +}
> +
> +static int
> +ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
> +{
> + struct ext4_group_info *grp1 = rb_entry(rb1,
> + struct ext4_group_info,
> + bb_avg_fragment_size_rb);
> + struct ext4_group_info *grp2 = rb_entry(rb2,
> + struct ext4_group_info,
> + bb_avg_fragment_size_rb);
> + int num_frags_1, num_frags_2;
> +
> + num_frags_1 = grp1->bb_fragments ?
> + grp1->bb_free / grp1->bb_fragments : 0;
> + num_frags_2 = grp2->bb_fragments ?
> + grp2->bb_free / grp2->bb_fragments : 0;
> +
> + return (num_frags_2 - num_frags_1);
> +}
> +
> +/*
> + * Reinsert grpinfo into the avg_fragment_size tree with new average
> + * fragment size.
> + */
> +static void
> +mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> +
> + if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
> + return;
> +
> + write_lock(&sbi->s_mb_rb_lock);
> + if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
> + rb_erase(&grp->bb_avg_fragment_size_rb,
> + &sbi->s_mb_avg_fragment_size_root);
> + RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
> + }
> +
> + ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
> + &grp->bb_avg_fragment_size_rb,
> + ext4_mb_avg_fragment_size_cmp);
> + write_unlock(&sbi->s_mb_rb_lock);
> +}
> +
> +/*
> + * Choose next group by traversing largest_free_order lists. Return 0 if next
> + * group was selected optimally. Return 1 if next group was not selected
> + * optimally. Updates *new_cr if cr level needs an update.
> + */
> +static int ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + struct ext4_group_info *iter, *grp;
> + int i;
> +
> + if (ac->ac_status == AC_STATUS_FOUND)
> + return 1;
> +
> + if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
> + atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
> +
> + grp = NULL;
> + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> + if (list_empty(&sbi->s_mb_largest_free_orders[i]))
> + continue;
> + read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
> + if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
> + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> + continue;
> + }
> + grp = NULL;
> + list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
> + bb_largest_free_order_node) {
> + if (sbi->s_mb_stats)
> + atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
> + if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
> + grp = iter;
> + break;
> + }
> + }
> + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> + if (grp)
> + break;
> + }
> +
> + if (!grp) {
> + /* Increment cr and search again */
> + *new_cr = 1;
> + } else {
> + *group = grp->bb_group;
> + ac->ac_last_optimal_group = *group;
> + ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
> + }
> + return 0;
> +}
> +
> +/*
> + * Choose next group by traversing average fragment size tree. Return 0 if next
> + * group was selected optimally. Return 1 if next group could not selected
> + * optimally (due to lock contention). Updates *new_cr if cr lvel needs an
> + * update.
> + */
> +static int ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + int avg_fragment_size, best_so_far;
> + struct rb_node *node, *found;
> + struct ext4_group_info *grp;
> +
> + /*
> + * If there is contention on the lock, instead of waiting for the lock
> + * to become available, just continue searching lineraly. We'll resume
> + * our rb tree search later starting at ac->ac_last_optimal_group.
> + */
> + if (!read_trylock(&sbi->s_mb_rb_lock))
> + return 1;
> +
> + if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
> + if (sbi->s_mb_stats)
> + atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
> + /* We have found something at CR 1 in the past */
> + grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
> + for (found = rb_next(&grp->bb_avg_fragment_size_rb); found != NULL;
> + found = rb_next(found)) {
> + grp = rb_entry(found, struct ext4_group_info,
> + bb_avg_fragment_size_rb);
> + if (sbi->s_mb_stats)
> + atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
> + if (likely(ext4_mb_good_group(ac, grp->bb_group, 1)))
> + break;
> + }
> +
> + goto done;
> + }
> +
> + node = sbi->s_mb_avg_fragment_size_root.rb_node;
> + best_so_far = 0;
> + found = NULL;
> +
> + while (node) {
> + grp = rb_entry(node, struct ext4_group_info,
> + bb_avg_fragment_size_rb);
> + avg_fragment_size = 0;
> + /*
> + * Perform this check without locking, we'll lock later to confirm.
> + */
> + if (ext4_mb_good_group(ac, grp->bb_group, 1)) {
> + avg_fragment_size = grp->bb_fragments ?
> + grp->bb_free / grp->bb_fragments : 0;
> + if (!best_so_far || avg_fragment_size < best_so_far) {
> + best_so_far = avg_fragment_size;
> + found = node;
> + }
> + }
> + if (avg_fragment_size > ac->ac_g_ex.fe_len)
> + node = node->rb_right;
> + else
> + node = node->rb_left;
> + }
> +
> +done:
> + if (found) {
> + grp = rb_entry(found, struct ext4_group_info,
> + bb_avg_fragment_size_rb);
> + *group = grp->bb_group;
> + ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
> + } else {
> + *new_cr = 2;
> + }
> +
> + read_unlock(&sbi->s_mb_rb_lock);
> + ac->ac_last_optimal_group = *group;
> + return 0;
> +}
> +
> +/*
> + * ext4_mb_choose_next_group: choose next group for allocation.
> + *
> + * @ac Allocation Context
> + * @new_cr This is an output parameter. If the there is no good group
> + * available at current CR level, this field is updated to indicate
> + * the new cr level that should be used.
> + * @group This is an input / output parameter. As an input it indicates the
> + * last group used for allocation. As output, this field indicates
> + * the next group that should be used.
> + * @ngroups Total number of groups
> + */
> +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> +{
> + int ret;
> +
> + *new_cr = ac->ac_criteria;
> +
> + if (!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN) ||
> + *new_cr >= 2 ||
> + !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
> + goto inc_and_return;
> +
> + if (ac->ac_groups_linear_remaining) {
> + ac->ac_groups_linear_remaining--;
> + goto inc_and_return;
> + }
> +
> + if (*new_cr == 0) {
> + ret = ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
> + if (ret)
> + goto inc_and_return;
> + }
> + if (*new_cr == 1) {
> + ret = ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
> + if (ret)
> + goto inc_and_return;
> + }
> + return;
> +
> +inc_and_return:
> + /*
> + * Artificially restricted ngroups for non-extent
> + * files makes group > ngroups possible on first loop.
> + */
> + *group = *group + 1;
hmm, Please help me correct if I am missing something here.
So, if *new_cr >= 2, then we will directly come here
increment the group and return to calling function.
But then we ended up incrementing the group while not
even searching in the given group to see if we could
find some blocks.
> + if (*group >= ngroups)
> + *group = 0;
> +}
> +
> /*
> * Cache the order of the largest free extent we have available in this block
> * group.
> @@ -751,18 +1053,33 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> static void
> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
> {
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> int i;
> - int bits;
>
> + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
> + write_lock(&sbi->s_mb_largest_free_orders_locks[
> + grp->bb_largest_free_order]);
> + list_del_init(&grp->bb_largest_free_order_node);
> + write_unlock(&sbi->s_mb_largest_free_orders_locks[
> + grp->bb_largest_free_order]);
> + }
> grp->bb_largest_free_order = -1; /* uninit */
>
> - bits = MB_NUM_ORDERS(sb) - 1;
> - for (i = bits; i >= 0; i--) {
> + for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
> if (grp->bb_counters[i] > 0) {
> grp->bb_largest_free_order = i;
> break;
> }
> }
> + if (test_opt2(sb, MB_OPTIMIZE_SCAN) &&
> + grp->bb_largest_free_order >= 0 && grp->bb_free) {
> + write_lock(&sbi->s_mb_largest_free_orders_locks[
> + grp->bb_largest_free_order]);
> + list_add_tail(&grp->bb_largest_free_order_node,
> + &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> + write_unlock(&sbi->s_mb_largest_free_orders_locks[
> + grp->bb_largest_free_order]);
> + }
> }
>
> static noinline_for_stack
> @@ -818,6 +1135,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
> period = get_cycles() - period;
> atomic_inc(&sbi->s_mb_buddies_generated);
> atomic64_add(period, &sbi->s_mb_generation_time);
> + mb_update_avg_fragment_size(sb, grp);
> }
>
> /* The buddy information is attached the buddy cache inode
> @@ -1517,6 +1835,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
>
> done:
> mb_set_largest_free_order(sb, e4b->bd_info);
> + mb_update_avg_fragment_size(sb, e4b->bd_info);
> mb_check_buddy(e4b);
> }
>
> @@ -1653,6 +1972,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
> }
> mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
>
> + mb_update_avg_fragment_size(e4b->bd_sb, e4b->bd_info);
> ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
> mb_check_buddy(e4b);
>
> @@ -2347,17 +2667,21 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> * from the goal value specified
> */
> group = ac->ac_g_ex.fe_group;
> + ac->ac_last_optimal_group = group;
Note we start from a goal group or optimal group.
> + ac->ac_groups_linear_remaining = sbi->s_mb_linear_limit;
> prefetch_grp = group;
>
> - for (i = 0; i < ngroups; group++, i++) {
> - int ret = 0;
> + for (i = 0; i < ngroups; i++) {
> + int ret = 0, new_cr;
> +
> cond_resched();
> - /*
> - * Artificially restricted ngroups for non-extent
> - * files makes group > ngroups possible on first loop.
> - */
> - if (group >= ngroups)
> - group = 0;
> +
> + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
> +
> + if (new_cr != cr) {
> + cr = new_cr;
> + goto repeat;
> + }
So ext4_mb_choose_next_group() will do group++ _even_ for cr >=2.
This would mean that we will never start our search from ac_g_ex.fe_group.
Why is that? Did I miss anything?
-ritesh
On 21/03/24 04:19PM, Harshad Shirwadkar wrote:
> This patch series improves cr 0 and cr 1 passes of the allocator
> signficantly. Currently, at cr 0 and 1, we perform linear lookups to
> find the matching groups. That's very inefficient for large file
> systems where there are millions of block groups. At cr 0, we only
> care about the groups that have the largest free order >= the
> request's order and at cr 1 we only care about groups where average
> fragment size > the request size. so, this patchset introduces new
> data structures that allow us to perform cr 0 lookup in constant time
> and cr 1 lookup in log (number of groups) time instead of linear.
>
> For cr 0, we add a list for each order and all the groups are enqueued
> to the appropriate list based on the largest free order in its buddy
> bitmap. This allows us to lookup a match at cr 0 in constant time.
>
> For cr 1, we add a new rb tree of groups sorted by largest fragment
> size. This allows us to lookup a match for cr 1 in log (num groups)
> time.
>
> These optimizations can be enabled by passing "mb_optimize_scan" mount
> option.
>
> These changes may result in allocations to be spread across the block
> device. While that would not matter some block devices (such as flash)
> it may be a cause of concern for other block devices that benefit from
> storing related content togetther such as disk. However, it can be
> argued that in high fragmentation scenrio, especially for large disks,
> it's still worth optimizing the scanning since in such cases, we get
> cpu bound on group scanning instead of getting IO bound. Perhaps, in
> future, we could dynamically turn this new optimization on based on
> fragmentation levels for such devices.
>
> Verified that there are no regressions in smoke tests (-g quick -c 4k).
>
> Also, to demonstrate the effectiveness for the patch series, following
> experiment was performed:
>
> Created a highly fragmented disk of size 65TB. The disk had no
> contiguous 2M regions. Following command was run consecutively for 3
> times:
>
> time dd if=/dev/urandom of=file bs=2M count=10
>
> Here are the results with and without cr 0/1 optimizations:
>
> |---------+------------------------------+---------------------------|
> | | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
> |---------+------------------------------+---------------------------|
> | 1st run | 5m1.871s | 2m47.642s |
> | 2nd run | 2m28.390s | 0m0.611s |
> | 3rd run | 2m26.530s | 0m1.255s |
> |---------+------------------------------+---------------------------|
>
> The patch [3/6] "ext4: add mballoc stats proc file" is a modified
> version of the patch originally written by Artem Blagodarenko
> ([email protected]). With that patch, I ran following
> command with and without optimizations.
>
> dd if=/dev/zero of=/mnt/file bs=2M count=2 conv=fsync
>
> Without optimizations:
>
> useless_c0_loops: 3
> useless_c1_loops: 39
> useless_c2_loops: 0
> useless_c3_loops: 0
>
> With optimizations:
>
> useless_c0_loops: 0
> useless_c1_loops: 0
> useless_c2_loops: 0
> useless_c3_loops: 0
>
> This shows that CR0 and CR1 optimizations get rid of useless CR0 and
> CR1 loops altogether thereby significantly reducing the number of
> groups that get considered.
>
> Changes from V4:
> ----------------
> - Only minor fixes, no significant changes
>
> Harshad Shirwadkar (6):
> ext4: drop s_mb_bal_lock and convert protected fields to atomic
> ext4: add ability to return parsed options from parse_options
> ext4: add mballoc stats proc file
> ext4: add MB_NUM_ORDERS macro
> ext4: improve cr 0 / cr 1 group scanning
> ext4: add proc files to monitor new structures
>
> fs/ext4/ext4.h | 30 ++-
> fs/ext4/mballoc.c | 572 +++++++++++++++++++++++++++++++++++++++++++---
> fs/ext4/mballoc.h | 22 +-
> fs/ext4/super.c | 79 +++++--
> fs/ext4/sysfs.c | 6 +
> 5 files changed, 652 insertions(+), 57 deletions(-)
>
Completed my review of this patch series.
Apart from the issue I mentioned in patch-5 of this v5 series. The rest
of the patches looks fine to me.
Please feel free to add:
Reviewed-by: Ritesh Harjani <[email protected]>
On Thu, Mar 25, 2021 at 6:50 AM Ritesh Harjani <[email protected]> wrote:
>
>
>
> On 3/25/21 4:49 AM, Harshad Shirwadkar wrote:
> > Instead of traversing through groups linearly, scan groups in specific
> > orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
> > largest free order >= the order of the request. So, with this patch,
> > we maintain lists for each possible order and insert each group into a
> > list based on the largest free order in its buddy bitmap. During cr 0
> > allocation, we traverse these lists in the increasing order of largest
> > free orders. This allows us to find a group with the best available cr
> > 0 match in constant time. If nothing can be found, we fallback to cr 1
> > immediately.
> >
> > At CR1, the story is slightly different. We want to traverse in the
> > order of increasing average fragment size. For CR1, we maintain a rb
> > tree of groupinfos which is sorted by average fragment size. Instead
> > of traversing linearly, at CR1, we traverse in the order of increasing
> > average fragment size, starting at the most optimal group. This brings
> > down cr 1 search complexity to log(num groups).
> >
> > For cr >= 2, we just perform the linear search as before. Also, in
> > case of lock contention, we intermittently fallback to linear search
> > even in CR 0 and CR 1 cases. This allows us to proceed during the
> > allocation path even in case of high contention.
> >
> > There is an opportunity to do optimization at CR2 too. That's because
> > at CR2 we only consider groups where bb_free counter (number of free
> > blocks) is greater than the request extent size. That's left as future
> > work.
> >
> > All the changes introduced in this patch are protected under a new
> > mount option "mb_optimize_scan".
> >
> > With this patchset, following experiment was performed:
> >
> > Created a highly fragmented disk of size 65TB. The disk had no
> > contiguous 2M regions. Following command was run consecutively for 3
> > times:
> >
> > time dd if=/dev/urandom of=file bs=2M count=10
> >
> > Here are the results with and without cr 0/1 optimizations introduced
> > in this patch:
> >
> > |---------+------------------------------+---------------------------|
> > | | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
> > |---------+------------------------------+---------------------------|
> > | 1st run | 5m1.871s | 2m47.642s |
> > | 2nd run | 2m28.390s | 0m0.611s |
> > | 3rd run | 2m26.530s | 0m1.255s |
> > |---------+------------------------------+---------------------------|
> >
> > Signed-off-by: Harshad Shirwadkar <[email protected]>
> > Reported-by: kernel test robot <[email protected]>
> > Reported-by: Dan Carpenter <[email protected]>
> > Reviewed-by: Andreas Dilger <[email protected]>
> > ---
> > fs/ext4/ext4.h | 19 ++-
> > fs/ext4/mballoc.c | 381 ++++++++++++++++++++++++++++++++++++++++++++--
> > fs/ext4/mballoc.h | 17 ++-
> > fs/ext4/super.c | 28 +++-
> > fs/ext4/sysfs.c | 2 +
> > 5 files changed, 432 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 85eeeba3bca3..5930c8cb5159 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -162,7 +162,10 @@ enum SHIFT_DIRECTION {
> > #define EXT4_MB_USE_RESERVED 0x2000
> > /* Do strict check for free blocks while retrying block allocation */
> > #define EXT4_MB_STRICT_CHECK 0x4000
> > -
> > +/* Large fragment size list lookup succeeded at least once for cr = 0 */
> > +#define EXT4_MB_CR0_OPTIMIZED 0x8000
> > +/* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
> > +#define EXT4_MB_CR1_OPTIMIZED 0x00010000
> > struct ext4_allocation_request {
> > /* target inode for block we're allocating */
> > struct inode *inode;
> > @@ -1247,7 +1250,9 @@ struct ext4_inode_info {
> > #define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */
> > #define EXT4_MOUNT2_DAX_NEVER 0x00000020 /* Do not allow Direct Access */
> > #define EXT4_MOUNT2_DAX_INODE 0x00000040 /* For printing options only */
> > -
> > +#define EXT4_MOUNT2_MB_OPTIMIZE_SCAN 0x00000080 /* Optimize group
> > + * scanning in mballoc
> > + */
> >
> > #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
> > ~EXT4_MOUNT_##opt
> > @@ -1527,9 +1532,14 @@ struct ext4_sb_info {
> > unsigned int s_mb_free_pending;
> > struct list_head s_freed_data_list; /* List of blocks to be freed
> > after commit completed */
> > + struct rb_root s_mb_avg_fragment_size_root;
> > + rwlock_t s_mb_rb_lock;
> > + struct list_head *s_mb_largest_free_orders;
> > + rwlock_t *s_mb_largest_free_orders_locks;
> >
> > /* tunables */
> > unsigned long s_stripe;
> > + unsigned int s_mb_linear_limit;
> > unsigned int s_mb_stream_request;
> > unsigned int s_mb_max_to_scan;
> > unsigned int s_mb_min_to_scan;
> > @@ -1553,6 +1563,8 @@ struct ext4_sb_info {
> > atomic_t s_bal_goals; /* goal hits */
> > atomic_t s_bal_breaks; /* too long searches */
> > atomic_t s_bal_2orders; /* 2^order hits */
> > + atomic_t s_bal_cr0_bad_suggestions;
> > + atomic_t s_bal_cr1_bad_suggestions;
> > atomic64_t s_bal_cX_groups_considered[4];
> > atomic64_t s_bal_cX_hits[4];
> > atomic64_t s_bal_cX_failed[4]; /* cX loop didn't find blocks */
> > @@ -3309,11 +3321,14 @@ struct ext4_group_info {
> > ext4_grpblk_t bb_free; /* total free blocks */
> > ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
> > ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */
> > + ext4_group_t bb_group; /* Group number */
> > struct list_head bb_prealloc_list;
> > #ifdef DOUBLE_CHECK
> > void *bb_bitmap;
> > #endif
> > struct rw_semaphore alloc_sem;
> > + struct rb_node bb_avg_fragment_size_rb;
> > + struct list_head bb_largest_free_order_node;
> > ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
> > * regions, index is order.
> > * bb_counters[3] = 5 means
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 15127d815461..cbf9a89c0ef5 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -127,11 +127,50 @@
> > * smallest multiple of the stripe value (sbi->s_stripe) which is
> > * greater than the default mb_group_prealloc.
> > *
> > + * If "mb_optimize_scan" mount option is set, we maintain in memory group info
> > + * structures in two data structures:
> > + *
> > + * 1) Array of largest free order lists (sbi->s_mb_largest_free_orders)
> > + *
> > + * Locking: sbi->s_mb_largest_free_orders_locks(array of rw locks)
> > + *
> > + * This is an array of lists where the index in the array represents the
> > + * largest free order in the buddy bitmap of the participating group infos of
> > + * that list. So, there are exactly MB_NUM_ORDERS(sb) (which means total
> > + * number of buddy bitmap orders possible) number of lists. Group-infos are
> > + * placed in appropriate lists.
> > + *
> > + * 2) Average fragment size rb tree (sbi->s_mb_avg_fragment_size_root)
> > + *
> > + * Locking: sbi->s_mb_rb_lock (rwlock)
> > + *
> > + * This is a red black tree consisting of group infos and the tree is sorted
> > + * by average fragment sizes (which is calculated as ext4_group_info->bb_free
> > + * / ext4_group_info->bb_fragments).
> > + *
> > + * When "mb_optimize_scan" mount option is set, mballoc consults the above data
> > + * structures to decide the order in which groups are to be traversed for
> > + * fulfilling an allocation request.
> > + *
> > + * At CR = 0, we look for groups which have the largest_free_order >= the order
> > + * of the request. We directly look at the largest free order list in the data
> > + * structure (1) above where largest_free_order = order of the request. If that
> > + * list is empty, we look at remaining list in the increasing order of
> > + * largest_free_order. This allows us to perform CR = 0 lookup in O(1) time.
> > + *
> > + * At CR = 1, we only consider groups where average fragment size > request
> > + * size. So, we lookup a group which has average fragment size just above or
> > + * equal to request size using our rb tree (data structure 2) in O(log N) time.
> > + *
> > + * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
> > + * linear order which requires O(N) search time for each CR 0 and CR 1 phase.
> > + *
> > * The regular allocator (using the buddy cache) supports a few tunables.
> > *
> > * /sys/fs/ext4/<partition>/mb_min_to_scan
> > * /sys/fs/ext4/<partition>/mb_max_to_scan
> > * /sys/fs/ext4/<partition>/mb_order2_req
> > + * /sys/fs/ext4/<partition>/mb_linear_limit
> > *
> > * The regular allocator uses buddy scan only if the request len is power of
> > * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The
> > @@ -149,6 +188,16 @@
> > * can be used for allocation. ext4_mb_good_group explains how the groups are
> > * checked.
> > *
> > + * When "mb_optimize_scan" is turned on, as mentioned above, the groups may not
> > + * get traversed linearly. That may result in subsequent allocations being not
> > + * close to each other. And so, the underlying device may get filled up in a
> > + * non-linear fashion. While that may not matter on non-rotational devices, for
> > + * rotational devices that may result in higher seek times. "mb_linear_limit"
> > + * tells mballoc how many groups mballoc should search linearly before
> > + * performing consulting above data structures for more efficient lookups. For
> > + * non rotational devices, this value defaults to 0 and for rotational devices
> > + * this is set to MB_DEFAULT_LINEAR_LIMIT.
> > + *
> > * Both the prealloc space are getting populated as above. So for the first
> > * request we will hit the buddy cache which will result in this prealloc
> > * space getting filled. The prealloc space is then later used for the
> > @@ -299,6 +348,8 @@
> > * - bitlock on a group (group)
> > * - object (inode/locality) (object)
> > * - per-pa lock (pa)
> > + * - cr0 lists lock (cr0)
> > + * - cr1 tree lock (cr1)
> > *
> > * Paths:
> > * - new pa
> > @@ -328,6 +379,9 @@
> > * group
> > * object
> > *
> > + * - allocation path (ext4_mb_regular_allocator)
> > + * group
> > + * cr0/cr1
> > */
> > static struct kmem_cache *ext4_pspace_cachep;
> > static struct kmem_cache *ext4_ac_cachep;
> > @@ -351,6 +405,9 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
> > ext4_group_t group);
> > static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
> >
> > +static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
> > + ext4_group_t group, int cr);
> > +
> > /*
> > * The algorithm using this percpu seq counter goes below:
> > * 1. We sample the percpu discard_pa_seq counter before trying for block
> > @@ -744,6 +801,251 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> > }
> > }
> >
> > +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
> > + int (*cmp)(struct rb_node *, struct rb_node *))
> > +{
> > + struct rb_node **iter = &root->rb_node, *parent = NULL;
> > +
> > + while (*iter) {
> > + parent = *iter;
> > + if (cmp(new, *iter) > 0)
> > + iter = &((*iter)->rb_left);
> > + else
> > + iter = &((*iter)->rb_right);
> > + }
> > +
> > + rb_link_node(new, parent, iter);
> > + rb_insert_color(new, root);
> > +}
> > +
> > +static int
> > +ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
> > +{
> > + struct ext4_group_info *grp1 = rb_entry(rb1,
> > + struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + struct ext4_group_info *grp2 = rb_entry(rb2,
> > + struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + int num_frags_1, num_frags_2;
> > +
> > + num_frags_1 = grp1->bb_fragments ?
> > + grp1->bb_free / grp1->bb_fragments : 0;
> > + num_frags_2 = grp2->bb_fragments ?
> > + grp2->bb_free / grp2->bb_fragments : 0;
> > +
> > + return (num_frags_2 - num_frags_1);
> > +}
> > +
> > +/*
> > + * Reinsert grpinfo into the avg_fragment_size tree with new average
> > + * fragment size.
> > + */
> > +static void
> > +mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +
> > + if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
> > + return;
> > +
> > + write_lock(&sbi->s_mb_rb_lock);
> > + if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
> > + rb_erase(&grp->bb_avg_fragment_size_rb,
> > + &sbi->s_mb_avg_fragment_size_root);
> > + RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
> > + }
> > +
> > + ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
> > + &grp->bb_avg_fragment_size_rb,
> > + ext4_mb_avg_fragment_size_cmp);
> > + write_unlock(&sbi->s_mb_rb_lock);
> > +}
> > +
> > +/*
> > + * Choose next group by traversing largest_free_order lists. Return 0 if next
> > + * group was selected optimally. Return 1 if next group was not selected
> > + * optimally. Updates *new_cr if cr level needs an update.
> > + */
> > +static int ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> > + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> > + struct ext4_group_info *iter, *grp;
> > + int i;
> > +
> > + if (ac->ac_status == AC_STATUS_FOUND)
> > + return 1;
> > +
> > + if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
> > + atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
> > +
> > + grp = NULL;
> > + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> > + if (list_empty(&sbi->s_mb_largest_free_orders[i]))
> > + continue;
> > + read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
> > + if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
> > + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> > + continue;
> > + }
> > + grp = NULL;
> > + list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
> > + bb_largest_free_order_node) {
> > + if (sbi->s_mb_stats)
> > + atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
> > + if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
> > + grp = iter;
> > + break;
> > + }
> > + }
> > + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> > + if (grp)
> > + break;
> > + }
> > +
> > + if (!grp) {
> > + /* Increment cr and search again */
> > + *new_cr = 1;
> > + } else {
> > + *group = grp->bb_group;
> > + ac->ac_last_optimal_group = *group;
> > + ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
> > + }
> > + return 0;
> > +}
> > +
> > +/*
> > + * Choose next group by traversing average fragment size tree. Return 0 if next
> > + * group was selected optimally. Return 1 if next group could not selected
> > + * optimally (due to lock contention). Updates *new_cr if cr lvel needs an
> > + * update.
> > + */
> > +static int ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> > + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> > + int avg_fragment_size, best_so_far;
> > + struct rb_node *node, *found;
> > + struct ext4_group_info *grp;
> > +
> > + /*
> > + * If there is contention on the lock, instead of waiting for the lock
> > + * to become available, just continue searching lineraly. We'll resume
> > + * our rb tree search later starting at ac->ac_last_optimal_group.
> > + */
> > + if (!read_trylock(&sbi->s_mb_rb_lock))
> > + return 1;
> > +
> > + if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
> > + if (sbi->s_mb_stats)
> > + atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
> > + /* We have found something at CR 1 in the past */
> > + grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
> > + for (found = rb_next(&grp->bb_avg_fragment_size_rb); found != NULL;
> > + found = rb_next(found)) {
> > + grp = rb_entry(found, struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + if (sbi->s_mb_stats)
> > + atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
> > + if (likely(ext4_mb_good_group(ac, grp->bb_group, 1)))
> > + break;
> > + }
> > +
> > + goto done;
> > + }
> > +
> > + node = sbi->s_mb_avg_fragment_size_root.rb_node;
> > + best_so_far = 0;
> > + found = NULL;
> > +
> > + while (node) {
> > + grp = rb_entry(node, struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + avg_fragment_size = 0;
> > + /*
> > + * Perform this check without locking, we'll lock later to confirm.
> > + */
> > + if (ext4_mb_good_group(ac, grp->bb_group, 1)) {
> > + avg_fragment_size = grp->bb_fragments ?
> > + grp->bb_free / grp->bb_fragments : 0;
> > + if (!best_so_far || avg_fragment_size < best_so_far) {
> > + best_so_far = avg_fragment_size;
> > + found = node;
> > + }
> > + }
> > + if (avg_fragment_size > ac->ac_g_ex.fe_len)
> > + node = node->rb_right;
> > + else
> > + node = node->rb_left;
> > + }
> > +
> > +done:
> > + if (found) {
> > + grp = rb_entry(found, struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + *group = grp->bb_group;
> > + ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
> > + } else {
> > + *new_cr = 2;
> > + }
> > +
> > + read_unlock(&sbi->s_mb_rb_lock);
> > + ac->ac_last_optimal_group = *group;
> > + return 0;
> > +}
> > +
> > +/*
> > + * ext4_mb_choose_next_group: choose next group for allocation.
> > + *
> > + * @ac Allocation Context
> > + * @new_cr This is an output parameter. If the there is no good group
> > + * available at current CR level, this field is updated to indicate
> > + * the new cr level that should be used.
> > + * @group This is an input / output parameter. As an input it indicates the
> > + * last group used for allocation. As output, this field indicates
> > + * the next group that should be used.
> > + * @ngroups Total number of groups
> > + */
> > +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> > + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> > +{
> > + int ret;
> > +
> > + *new_cr = ac->ac_criteria;
> > +
> > + if (!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN) ||
> > + *new_cr >= 2 ||
> > + !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
> > + goto inc_and_return;
> > +
> > + if (ac->ac_groups_linear_remaining) {
> > + ac->ac_groups_linear_remaining--;
> > + goto inc_and_return;
> > + }
> > +
> > + if (*new_cr == 0) {
> > + ret = ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
> > + if (ret)
> > + goto inc_and_return;
> > + }
> > + if (*new_cr == 1) {
> > + ret = ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
> > + if (ret)
> > + goto inc_and_return;
> > + }
> > + return;
> > +
> > +inc_and_return:
> > + /*
> > + * Artificially restricted ngroups for non-extent
> > + * files makes group > ngroups possible on first loop.
> > + */
> > + *group = *group + 1;
>
> hmm, Please help me correct if I am missing something here.
> So, if *new_cr >= 2, then we will directly come here
> increment the group and return to calling function.
> But then we ended up incrementing the group while not
> even searching in the given group to see if we could
> find some blocks.
>
>
>
> > + if (*group >= ngroups)
> > + *group = 0;
> > +}
> > +
> > /*
> > * Cache the order of the largest free extent we have available in this block
> > * group.
> > @@ -751,18 +1053,33 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> > static void
> > mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
> > {
> > + struct ext4_sb_info *sbi = EXT4_SB(sb);
> > int i;
> > - int bits;
> >
> > + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
> > + write_lock(&sbi->s_mb_largest_free_orders_locks[
> > + grp->bb_largest_free_order]);
> > + list_del_init(&grp->bb_largest_free_order_node);
> > + write_unlock(&sbi->s_mb_largest_free_orders_locks[
> > + grp->bb_largest_free_order]);
> > + }
> > grp->bb_largest_free_order = -1; /* uninit */
> >
> > - bits = MB_NUM_ORDERS(sb) - 1;
> > - for (i = bits; i >= 0; i--) {
> > + for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
> > if (grp->bb_counters[i] > 0) {
> > grp->bb_largest_free_order = i;
> > break;
> > }
> > }
> > + if (test_opt2(sb, MB_OPTIMIZE_SCAN) &&
> > + grp->bb_largest_free_order >= 0 && grp->bb_free) {
> > + write_lock(&sbi->s_mb_largest_free_orders_locks[
> > + grp->bb_largest_free_order]);
> > + list_add_tail(&grp->bb_largest_free_order_node,
> > + &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> > + write_unlock(&sbi->s_mb_largest_free_orders_locks[
> > + grp->bb_largest_free_order]);
> > + }
> > }
> >
> > static noinline_for_stack
> > @@ -818,6 +1135,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
> > period = get_cycles() - period;
> > atomic_inc(&sbi->s_mb_buddies_generated);
> > atomic64_add(period, &sbi->s_mb_generation_time);
> > + mb_update_avg_fragment_size(sb, grp);
> > }
> >
> > /* The buddy information is attached the buddy cache inode
> > @@ -1517,6 +1835,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
> >
> > done:
> > mb_set_largest_free_order(sb, e4b->bd_info);
> > + mb_update_avg_fragment_size(sb, e4b->bd_info);
> > mb_check_buddy(e4b);
> > }
> >
> > @@ -1653,6 +1972,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
> > }
> > mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
> >
> > + mb_update_avg_fragment_size(e4b->bd_sb, e4b->bd_info);
> > ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
> > mb_check_buddy(e4b);
> >
> > @@ -2347,17 +2667,21 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> > * from the goal value specified
> > */
> > group = ac->ac_g_ex.fe_group;
> > + ac->ac_last_optimal_group = group;
>
> Note we start from a goal group or optimal group.
>
> > + ac->ac_groups_linear_remaining = sbi->s_mb_linear_limit;
> > prefetch_grp = group;
> >
> > - for (i = 0; i < ngroups; group++, i++) {
> > - int ret = 0;
> > + for (i = 0; i < ngroups; i++) {
> > + int ret = 0, new_cr;
> > +
> > cond_resched();
> > - /*
> > - * Artificially restricted ngroups for non-extent
> > - * files makes group > ngroups possible on first loop.
> > - */
> > - if (group >= ngroups)
> > - group = 0;
> > +
> > + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
> > +
> > + if (new_cr != cr) {
> > + cr = new_cr;
> > + goto repeat;
> > + }
>
>
> So ext4_mb_choose_next_group() will do group++ _even_ for cr >=2.
> This would mean that we will never start our search from ac_g_ex.fe_group.
> Why is that? Did I miss anything?
Thanks Ritesh, that's a good catch! So group choosing needs to happen
at the start of the group while increment needs to happen at the end.
I'll fix this in the next version.
- Harshad
>
>
> -ritesh
On Mar 24, 2021, at 5:19 PM, Harshad Shirwadkar <[email protected]> wrote:
>
> Instead of traversing through groups linearly, scan groups in specific
> orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
> largest free order >= the order of the request. So, with this patch,
> we maintain lists for each possible order and insert each group into a
> list based on the largest free order in its buddy bitmap. During cr 0
> allocation, we traverse these lists in the increasing order of largest
> free orders. This allows us to find a group with the best available cr
> 0 match in constant time. If nothing can be found, we fallback to cr 1
> immediately.
>
> At CR1, the story is slightly different. We want to traverse in the
> order of increasing average fragment size. For CR1, we maintain a rb
> tree of groupinfos which is sorted by average fragment size. Instead
> of traversing linearly, at CR1, we traverse in the order of increasing
> average fragment size, starting at the most optimal group. This brings
> down cr 1 search complexity to log(num groups).
>
> For cr >= 2, we just perform the linear search as before. Also, in
> case of lock contention, we intermittently fallback to linear search
> even in CR 0 and CR 1 cases. This allows us to proceed during the
> allocation path even in case of high contention.
If you are refreshing this patch anyway, could you rename "mb_linear_limit"
to "mb_linear_groups" or "mb_max_linear_groups" or similar? It otherwise
isn't clear what units the "limit" is in (could be MB/GB or something else).
Cheers, Andreas
> There is an opportunity to do optimization at CR2 too. That's because
> at CR2 we only consider groups where bb_free counter (number of free
> blocks) is greater than the request extent size. That's left as future
> work.
>
> All the changes introduced in this patch are protected under a new
> mount option "mb_optimize_scan".
>
> With this patchset, following experiment was performed:
>
> Created a highly fragmented disk of size 65TB. The disk had no
> contiguous 2M regions. Following command was run consecutively for 3
> times:
>
> time dd if=/dev/urandom of=file bs=2M count=10
>
> Here are the results with and without cr 0/1 optimizations introduced
> in this patch:
>
> |---------+------------------------------+---------------------------|
> | | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
> |---------+------------------------------+---------------------------|
> | 1st run | 5m1.871s | 2m47.642s |
> | 2nd run | 2m28.390s | 0m0.611s |
> | 3rd run | 2m26.530s | 0m1.255s |
> |---------+------------------------------+---------------------------|
>
> Signed-off-by: Harshad Shirwadkar <[email protected]>
> Reported-by: kernel test robot <[email protected]>
> Reported-by: Dan Carpenter <[email protected]>
> Reviewed-by: Andreas Dilger <[email protected]>
> ---
> fs/ext4/ext4.h | 19 ++-
> fs/ext4/mballoc.c | 381 ++++++++++++++++++++++++++++++++++++++++++++--
> fs/ext4/mballoc.h | 17 ++-
> fs/ext4/super.c | 28 +++-
> fs/ext4/sysfs.c | 2 +
> 5 files changed, 432 insertions(+), 15 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 85eeeba3bca3..5930c8cb5159 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -162,7 +162,10 @@ enum SHIFT_DIRECTION {
> #define EXT4_MB_USE_RESERVED 0x2000
> /* Do strict check for free blocks while retrying block allocation */
> #define EXT4_MB_STRICT_CHECK 0x4000
> -
> +/* Large fragment size list lookup succeeded at least once for cr = 0 */
> +#define EXT4_MB_CR0_OPTIMIZED 0x8000
> +/* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
> +#define EXT4_MB_CR1_OPTIMIZED 0x00010000
> struct ext4_allocation_request {
> /* target inode for block we're allocating */
> struct inode *inode;
> @@ -1247,7 +1250,9 @@ struct ext4_inode_info {
> #define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */
> #define EXT4_MOUNT2_DAX_NEVER 0x00000020 /* Do not allow Direct Access */
> #define EXT4_MOUNT2_DAX_INODE 0x00000040 /* For printing options only */
> -
> +#define EXT4_MOUNT2_MB_OPTIMIZE_SCAN 0x00000080 /* Optimize group
> + * scanning in mballoc
> + */
>
> #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
> ~EXT4_MOUNT_##opt
> @@ -1527,9 +1532,14 @@ struct ext4_sb_info {
> unsigned int s_mb_free_pending;
> struct list_head s_freed_data_list; /* List of blocks to be freed
> after commit completed */
> + struct rb_root s_mb_avg_fragment_size_root;
> + rwlock_t s_mb_rb_lock;
> + struct list_head *s_mb_largest_free_orders;
> + rwlock_t *s_mb_largest_free_orders_locks;
>
> /* tunables */
> unsigned long s_stripe;
> + unsigned int s_mb_linear_limit;
> unsigned int s_mb_stream_request;
> unsigned int s_mb_max_to_scan;
> unsigned int s_mb_min_to_scan;
> @@ -1553,6 +1563,8 @@ struct ext4_sb_info {
> atomic_t s_bal_goals; /* goal hits */
> atomic_t s_bal_breaks; /* too long searches */
> atomic_t s_bal_2orders; /* 2^order hits */
> + atomic_t s_bal_cr0_bad_suggestions;
> + atomic_t s_bal_cr1_bad_suggestions;
> atomic64_t s_bal_cX_groups_considered[4];
> atomic64_t s_bal_cX_hits[4];
> atomic64_t s_bal_cX_failed[4]; /* cX loop didn't find blocks */
> @@ -3309,11 +3321,14 @@ struct ext4_group_info {
> ext4_grpblk_t bb_free; /* total free blocks */
> ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
> ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */
> + ext4_group_t bb_group; /* Group number */
> struct list_head bb_prealloc_list;
> #ifdef DOUBLE_CHECK
> void *bb_bitmap;
> #endif
> struct rw_semaphore alloc_sem;
> + struct rb_node bb_avg_fragment_size_rb;
> + struct list_head bb_largest_free_order_node;
> ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
> * regions, index is order.
> * bb_counters[3] = 5 means
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 15127d815461..cbf9a89c0ef5 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -127,11 +127,50 @@
> * smallest multiple of the stripe value (sbi->s_stripe) which is
> * greater than the default mb_group_prealloc.
> *
> + * If "mb_optimize_scan" mount option is set, we maintain in memory group info
> + * structures in two data structures:
> + *
> + * 1) Array of largest free order lists (sbi->s_mb_largest_free_orders)
> + *
> + * Locking: sbi->s_mb_largest_free_orders_locks(array of rw locks)
> + *
> + * This is an array of lists where the index in the array represents the
> + * largest free order in the buddy bitmap of the participating group infos of
> + * that list. So, there are exactly MB_NUM_ORDERS(sb) (which means total
> + * number of buddy bitmap orders possible) number of lists. Group-infos are
> + * placed in appropriate lists.
> + *
> + * 2) Average fragment size rb tree (sbi->s_mb_avg_fragment_size_root)
> + *
> + * Locking: sbi->s_mb_rb_lock (rwlock)
> + *
> + * This is a red black tree consisting of group infos and the tree is sorted
> + * by average fragment sizes (which is calculated as ext4_group_info->bb_free
> + * / ext4_group_info->bb_fragments).
> + *
> + * When "mb_optimize_scan" mount option is set, mballoc consults the above data
> + * structures to decide the order in which groups are to be traversed for
> + * fulfilling an allocation request.
> + *
> + * At CR = 0, we look for groups which have the largest_free_order >= the order
> + * of the request. We directly look at the largest free order list in the data
> + * structure (1) above where largest_free_order = order of the request. If that
> + * list is empty, we look at remaining list in the increasing order of
> + * largest_free_order. This allows us to perform CR = 0 lookup in O(1) time.
> + *
> + * At CR = 1, we only consider groups where average fragment size > request
> + * size. So, we lookup a group which has average fragment size just above or
> + * equal to request size using our rb tree (data structure 2) in O(log N) time.
> + *
> + * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
> + * linear order which requires O(N) search time for each CR 0 and CR 1 phase.
> + *
> * The regular allocator (using the buddy cache) supports a few tunables.
> *
> * /sys/fs/ext4/<partition>/mb_min_to_scan
> * /sys/fs/ext4/<partition>/mb_max_to_scan
> * /sys/fs/ext4/<partition>/mb_order2_req
> + * /sys/fs/ext4/<partition>/mb_linear_limit
> *
> * The regular allocator uses buddy scan only if the request len is power of
> * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The
> @@ -149,6 +188,16 @@
> * can be used for allocation. ext4_mb_good_group explains how the groups are
> * checked.
> *
> + * When "mb_optimize_scan" is turned on, as mentioned above, the groups may not
> + * get traversed linearly. That may result in subsequent allocations being not
> + * close to each other. And so, the underlying device may get filled up in a
> + * non-linear fashion. While that may not matter on non-rotational devices, for
> + * rotational devices that may result in higher seek times. "mb_linear_limit"
> + * tells mballoc how many groups mballoc should search linearly before
> + * performing consulting above data structures for more efficient lookups. For
> + * non rotational devices, this value defaults to 0 and for rotational devices
> + * this is set to MB_DEFAULT_LINEAR_LIMIT.
> + *
> * Both the prealloc space are getting populated as above. So for the first
> * request we will hit the buddy cache which will result in this prealloc
> * space getting filled. The prealloc space is then later used for the
> @@ -299,6 +348,8 @@
> * - bitlock on a group (group)
> * - object (inode/locality) (object)
> * - per-pa lock (pa)
> + * - cr0 lists lock (cr0)
> + * - cr1 tree lock (cr1)
> *
> * Paths:
> * - new pa
> @@ -328,6 +379,9 @@
> * group
> * object
> *
> + * - allocation path (ext4_mb_regular_allocator)
> + * group
> + * cr0/cr1
> */
> static struct kmem_cache *ext4_pspace_cachep;
> static struct kmem_cache *ext4_ac_cachep;
> @@ -351,6 +405,9 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
> ext4_group_t group);
> static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
>
> +static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
> + ext4_group_t group, int cr);
> +
> /*
> * The algorithm using this percpu seq counter goes below:
> * 1. We sample the percpu discard_pa_seq counter before trying for block
> @@ -744,6 +801,251 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> }
> }
>
> +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
> + int (*cmp)(struct rb_node *, struct rb_node *))
> +{
> + struct rb_node **iter = &root->rb_node, *parent = NULL;
> +
> + while (*iter) {
> + parent = *iter;
> + if (cmp(new, *iter) > 0)
> + iter = &((*iter)->rb_left);
> + else
> + iter = &((*iter)->rb_right);
> + }
> +
> + rb_link_node(new, parent, iter);
> + rb_insert_color(new, root);
> +}
> +
> +static int
> +ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
> +{
> + struct ext4_group_info *grp1 = rb_entry(rb1,
> + struct ext4_group_info,
> + bb_avg_fragment_size_rb);
> + struct ext4_group_info *grp2 = rb_entry(rb2,
> + struct ext4_group_info,
> + bb_avg_fragment_size_rb);
> + int num_frags_1, num_frags_2;
> +
> + num_frags_1 = grp1->bb_fragments ?
> + grp1->bb_free / grp1->bb_fragments : 0;
> + num_frags_2 = grp2->bb_fragments ?
> + grp2->bb_free / grp2->bb_fragments : 0;
> +
> + return (num_frags_2 - num_frags_1);
> +}
> +
> +/*
> + * Reinsert grpinfo into the avg_fragment_size tree with new average
> + * fragment size.
> + */
> +static void
> +mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> +
> + if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
> + return;
> +
> + write_lock(&sbi->s_mb_rb_lock);
> + if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
> + rb_erase(&grp->bb_avg_fragment_size_rb,
> + &sbi->s_mb_avg_fragment_size_root);
> + RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
> + }
> +
> + ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
> + &grp->bb_avg_fragment_size_rb,
> + ext4_mb_avg_fragment_size_cmp);
> + write_unlock(&sbi->s_mb_rb_lock);
> +}
> +
> +/*
> + * Choose next group by traversing largest_free_order lists. Return 0 if next
> + * group was selected optimally. Return 1 if next group was not selected
> + * optimally. Updates *new_cr if cr level needs an update.
> + */
> +static int ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + struct ext4_group_info *iter, *grp;
> + int i;
> +
> + if (ac->ac_status == AC_STATUS_FOUND)
> + return 1;
> +
> + if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
> + atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
> +
> + grp = NULL;
> + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> + if (list_empty(&sbi->s_mb_largest_free_orders[i]))
> + continue;
> + read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
> + if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
> + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> + continue;
> + }
> + grp = NULL;
> + list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
> + bb_largest_free_order_node) {
> + if (sbi->s_mb_stats)
> + atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
> + if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
> + grp = iter;
> + break;
> + }
> + }
> + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> + if (grp)
> + break;
> + }
> +
> + if (!grp) {
> + /* Increment cr and search again */
> + *new_cr = 1;
> + } else {
> + *group = grp->bb_group;
> + ac->ac_last_optimal_group = *group;
> + ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
> + }
> + return 0;
> +}
> +
> +/*
> + * Choose next group by traversing average fragment size tree. Return 0 if next
> + * group was selected optimally. Return 1 if next group could not selected
> + * optimally (due to lock contention). Updates *new_cr if cr lvel needs an
> + * update.
> + */
> +static int ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + int avg_fragment_size, best_so_far;
> + struct rb_node *node, *found;
> + struct ext4_group_info *grp;
> +
> + /*
> + * If there is contention on the lock, instead of waiting for the lock
> + * to become available, just continue searching lineraly. We'll resume
> + * our rb tree search later starting at ac->ac_last_optimal_group.
> + */
> + if (!read_trylock(&sbi->s_mb_rb_lock))
> + return 1;
> +
> + if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
> + if (sbi->s_mb_stats)
> + atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
> + /* We have found something at CR 1 in the past */
> + grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
> + for (found = rb_next(&grp->bb_avg_fragment_size_rb); found != NULL;
> + found = rb_next(found)) {
> + grp = rb_entry(found, struct ext4_group_info,
> + bb_avg_fragment_size_rb);
> + if (sbi->s_mb_stats)
> + atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
> + if (likely(ext4_mb_good_group(ac, grp->bb_group, 1)))
> + break;
> + }
> +
> + goto done;
> + }
> +
> + node = sbi->s_mb_avg_fragment_size_root.rb_node;
> + best_so_far = 0;
> + found = NULL;
> +
> + while (node) {
> + grp = rb_entry(node, struct ext4_group_info,
> + bb_avg_fragment_size_rb);
> + avg_fragment_size = 0;
> + /*
> + * Perform this check without locking, we'll lock later to confirm.
> + */
> + if (ext4_mb_good_group(ac, grp->bb_group, 1)) {
> + avg_fragment_size = grp->bb_fragments ?
> + grp->bb_free / grp->bb_fragments : 0;
> + if (!best_so_far || avg_fragment_size < best_so_far) {
> + best_so_far = avg_fragment_size;
> + found = node;
> + }
> + }
> + if (avg_fragment_size > ac->ac_g_ex.fe_len)
> + node = node->rb_right;
> + else
> + node = node->rb_left;
> + }
> +
> +done:
> + if (found) {
> + grp = rb_entry(found, struct ext4_group_info,
> + bb_avg_fragment_size_rb);
> + *group = grp->bb_group;
> + ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
> + } else {
> + *new_cr = 2;
> + }
> +
> + read_unlock(&sbi->s_mb_rb_lock);
> + ac->ac_last_optimal_group = *group;
> + return 0;
> +}
> +
> +/*
> + * ext4_mb_choose_next_group: choose next group for allocation.
> + *
> + * @ac Allocation Context
> + * @new_cr This is an output parameter. If the there is no good group
> + * available at current CR level, this field is updated to indicate
> + * the new cr level that should be used.
> + * @group This is an input / output parameter. As an input it indicates the
> + * last group used for allocation. As output, this field indicates
> + * the next group that should be used.
> + * @ngroups Total number of groups
> + */
> +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> +{
> + int ret;
> +
> + *new_cr = ac->ac_criteria;
> +
> + if (!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN) ||
> + *new_cr >= 2 ||
> + !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
> + goto inc_and_return;
> +
> + if (ac->ac_groups_linear_remaining) {
> + ac->ac_groups_linear_remaining--;
> + goto inc_and_return;
> + }
> +
> + if (*new_cr == 0) {
> + ret = ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
> + if (ret)
> + goto inc_and_return;
> + }
> + if (*new_cr == 1) {
> + ret = ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
> + if (ret)
> + goto inc_and_return;
> + }
> + return;
> +
> +inc_and_return:
> + /*
> + * Artificially restricted ngroups for non-extent
> + * files makes group > ngroups possible on first loop.
> + */
> + *group = *group + 1;
> + if (*group >= ngroups)
> + *group = 0;
> +}
> +
> /*
> * Cache the order of the largest free extent we have available in this block
> * group.
> @@ -751,18 +1053,33 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> static void
> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
> {
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> int i;
> - int bits;
>
> + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
> + write_lock(&sbi->s_mb_largest_free_orders_locks[
> + grp->bb_largest_free_order]);
> + list_del_init(&grp->bb_largest_free_order_node);
> + write_unlock(&sbi->s_mb_largest_free_orders_locks[
> + grp->bb_largest_free_order]);
> + }
> grp->bb_largest_free_order = -1; /* uninit */
>
> - bits = MB_NUM_ORDERS(sb) - 1;
> - for (i = bits; i >= 0; i--) {
> + for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
> if (grp->bb_counters[i] > 0) {
> grp->bb_largest_free_order = i;
> break;
> }
> }
> + if (test_opt2(sb, MB_OPTIMIZE_SCAN) &&
> + grp->bb_largest_free_order >= 0 && grp->bb_free) {
> + write_lock(&sbi->s_mb_largest_free_orders_locks[
> + grp->bb_largest_free_order]);
> + list_add_tail(&grp->bb_largest_free_order_node,
> + &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> + write_unlock(&sbi->s_mb_largest_free_orders_locks[
> + grp->bb_largest_free_order]);
> + }
> }
>
> static noinline_for_stack
> @@ -818,6 +1135,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
> period = get_cycles() - period;
> atomic_inc(&sbi->s_mb_buddies_generated);
> atomic64_add(period, &sbi->s_mb_generation_time);
> + mb_update_avg_fragment_size(sb, grp);
> }
>
> /* The buddy information is attached the buddy cache inode
> @@ -1517,6 +1835,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
>
> done:
> mb_set_largest_free_order(sb, e4b->bd_info);
> + mb_update_avg_fragment_size(sb, e4b->bd_info);
> mb_check_buddy(e4b);
> }
>
> @@ -1653,6 +1972,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
> }
> mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
>
> + mb_update_avg_fragment_size(e4b->bd_sb, e4b->bd_info);
> ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
> mb_check_buddy(e4b);
>
> @@ -2347,17 +2667,21 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> * from the goal value specified
> */
> group = ac->ac_g_ex.fe_group;
> + ac->ac_last_optimal_group = group;
> + ac->ac_groups_linear_remaining = sbi->s_mb_linear_limit;
> prefetch_grp = group;
>
> - for (i = 0; i < ngroups; group++, i++) {
> - int ret = 0;
> + for (i = 0; i < ngroups; i++) {
> + int ret = 0, new_cr;
> +
> cond_resched();
> - /*
> - * Artificially restricted ngroups for non-extent
> - * files makes group > ngroups possible on first loop.
> - */
> - if (group >= ngroups)
> - group = 0;
> +
> + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
> +
> + if (new_cr != cr) {
> + cr = new_cr;
> + goto repeat;
> + }
>
> /*
> * Batch reads of the block allocation bitmaps
> @@ -2578,6 +2902,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
> atomic64_read(&sbi->s_bal_cX_groups_considered[0]));
> seq_printf(seq, "\t\tuseless_loops: %llu\n",
> atomic64_read(&sbi->s_bal_cX_failed[0]));
> + seq_printf(seq, "\t\tbad_suggestions: %u\n",
> + atomic_read(&sbi->s_bal_cr0_bad_suggestions));
>
> seq_puts(seq, "\tcr1_stats:\n");
> seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[1]));
> @@ -2585,6 +2911,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
> atomic64_read(&sbi->s_bal_cX_groups_considered[1]));
> seq_printf(seq, "\t\tuseless_loops: %llu\n",
> atomic64_read(&sbi->s_bal_cX_failed[1]));
> + seq_printf(seq, "\t\tbad_suggestions: %u\n",
> + atomic_read(&sbi->s_bal_cr1_bad_suggestions));
>
> seq_puts(seq, "\tcr2_stats:\n");
> seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[2]));
> @@ -2719,7 +3047,10 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
> INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
> init_rwsem(&meta_group_info[i]->alloc_sem);
> meta_group_info[i]->bb_free_root = RB_ROOT;
> + INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
> + RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb);
> meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
> + meta_group_info[i]->bb_group = group;
>
> mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
> return 0;
> @@ -2909,6 +3240,26 @@ int ext4_mb_init(struct super_block *sb)
> i++;
> } while (i < MB_NUM_ORDERS(sb));
>
> + sbi->s_mb_avg_fragment_size_root = RB_ROOT;
> + sbi->s_mb_largest_free_orders =
> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> + GFP_KERNEL);
> + if (!sbi->s_mb_largest_free_orders) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + sbi->s_mb_largest_free_orders_locks =
> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> + GFP_KERNEL);
> + if (!sbi->s_mb_largest_free_orders_locks) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> + INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
> + rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
> + }
> + rwlock_init(&sbi->s_mb_rb_lock);
>
> spin_lock_init(&sbi->s_md_lock);
> sbi->s_mb_free_pending = 0;
> @@ -2961,6 +3312,10 @@ int ext4_mb_init(struct super_block *sb)
> spin_lock_init(&lg->lg_prealloc_lock);
> }
>
> + if (blk_queue_nonrot(bdev_get_queue(sb->s_bdev)))
> + sbi->s_mb_linear_limit = 0;
> + else
> + sbi->s_mb_linear_limit = MB_DEFAULT_LINEAR_LIMIT;
> /* init file for buddy data */
> ret = ext4_mb_init_backend(sb);
> if (ret != 0)
> @@ -2972,6 +3327,8 @@ int ext4_mb_init(struct super_block *sb)
> free_percpu(sbi->s_locality_groups);
> sbi->s_locality_groups = NULL;
> out:
> + kfree(sbi->s_mb_largest_free_orders);
> + kfree(sbi->s_mb_largest_free_orders_locks);
> kfree(sbi->s_mb_offsets);
> sbi->s_mb_offsets = NULL;
> kfree(sbi->s_mb_maxs);
> @@ -3028,6 +3385,8 @@ int ext4_mb_release(struct super_block *sb)
> kvfree(group_info);
> rcu_read_unlock();
> }
> + kfree(sbi->s_mb_largest_free_orders);
> + kfree(sbi->s_mb_largest_free_orders_locks);
> kfree(sbi->s_mb_offsets);
> kfree(sbi->s_mb_maxs);
> iput(sbi->s_buddy_cache);
> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> index 68111a10cfee..02585e3cbcad 100644
> --- a/fs/ext4/mballoc.h
> +++ b/fs/ext4/mballoc.h
> @@ -78,6 +78,18 @@
> */
> #define MB_DEFAULT_MAX_INODE_PREALLOC 512
>
> +/*
> + * Number of groups to search linearly before performing group scanning
> + * optimization.
> + */
> +#define MB_DEFAULT_LINEAR_LIMIT 4
> +
> +/*
> + * Minimum number of groups that should be present in the file system to perform
> + * group scanning optimizations.
> + */
> +#define MB_DEFAULT_LINEAR_SCAN_THRESHOLD 16
> +
> /*
> * Number of valid buddy orders
> */
> @@ -166,11 +178,14 @@ struct ext4_allocation_context {
> /* copy of the best found extent taken before preallocation efforts */
> struct ext4_free_extent ac_f_ex;
>
> + ext4_group_t ac_last_optimal_group;
> + __u32 ac_groups_considered;
> + __u32 ac_flags; /* allocation hints */
> __u16 ac_groups_scanned;
> + __u16 ac_groups_linear_remaining;
> __u16 ac_found;
> __u16 ac_tail;
> __u16 ac_buddy;
> - __u16 ac_flags; /* allocation hints */
> __u8 ac_status;
> __u8 ac_criteria;
> __u8 ac_2order; /* if request is to allocate 2^N blocks and
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 3a2cd9fb7e73..6116640081c0 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1687,7 +1687,7 @@ enum {
> Opt_dioread_nolock, Opt_dioread_lock,
> Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
> Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
> - Opt_prefetch_block_bitmaps,
> + Opt_prefetch_block_bitmaps, Opt_mb_optimize_scan,
> #ifdef CONFIG_EXT4_DEBUG
> Opt_fc_debug_max_replay, Opt_fc_debug_force
> #endif
> @@ -1788,6 +1788,7 @@ static const match_table_t tokens = {
> {Opt_nombcache, "nombcache"},
> {Opt_nombcache, "no_mbcache"}, /* for backward compatibility */
> {Opt_prefetch_block_bitmaps, "prefetch_block_bitmaps"},
> + {Opt_mb_optimize_scan, "mb_optimize_scan=%d"},
> {Opt_removed, "check=none"}, /* mount option from ext2/3 */
> {Opt_removed, "nocheck"}, /* mount option from ext2/3 */
> {Opt_removed, "reservation"}, /* mount option from ext2/3 */
> @@ -1820,6 +1821,8 @@ static ext4_fsblk_t get_sb_block(void **data)
> }
>
> #define DEFAULT_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
> +#define DEFAULT_MB_OPTIMIZE_SCAN (-1)
> +
> static const char deprecated_msg[] =
> "Mount option \"%s\" will be removed by %s\n"
> "Contact [email protected] if you think we should keep it.\n";
> @@ -2008,6 +2011,7 @@ static const struct mount_opts {
> {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
> {Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
> MOPT_SET},
> + {Opt_mb_optimize_scan, EXT4_MOUNT2_MB_OPTIMIZE_SCAN, MOPT_GTE0},
> #ifdef CONFIG_EXT4_DEBUG
> {Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
> MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
> @@ -2092,6 +2096,7 @@ static int ext4_set_test_dummy_encryption(struct super_block *sb,
> struct ext4_parsed_options {
> unsigned long journal_devnum;
> unsigned int journal_ioprio;
> + int mb_optimize_scan;
> };
>
> static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> @@ -2388,6 +2393,13 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> sbi->s_mount_opt |= m->mount_opt;
> } else if (token == Opt_data_err_ignore) {
> sbi->s_mount_opt &= ~m->mount_opt;
> + } else if (token == Opt_mb_optimize_scan) {
> + if (arg != 0 && arg != 1) {
> + ext4_msg(sb, KERN_WARNING,
> + "mb_optimize_scan should be set to 0 or 1.");
> + return -1;
> + }
> + parsed_opts->mb_optimize_scan = arg;
> } else {
> if (!args->from)
> arg = 1;
> @@ -4034,6 +4046,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> /* Set defaults for the variables that will be set during parsing */
> parsed_opts.journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
> parsed_opts.journal_devnum = 0;
> + parsed_opts.mb_optimize_scan = DEFAULT_MB_OPTIMIZE_SCAN;
>
> if ((data && !orig_data) || !sbi)
> goto out_free_base;
> @@ -4984,6 +4997,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> ext4_fc_replay_cleanup(sb);
>
> ext4_ext_init(sb);
> +
> + /*
> + * Enable optimize_scan if number of groups is > threshold. This can be
> + * turned off by passing "mb_optimize_scan=0". This can also be
> + * turned on forcefully by passing "mb_optimize_scan=1".
> + */
> + if (parsed_opts.mb_optimize_scan == 1)
> + set_opt2(sb, MB_OPTIMIZE_SCAN);
> + else if (parsed_opts.mb_optimize_scan == 0)
> + clear_opt2(sb, MB_OPTIMIZE_SCAN);
> + else if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD)
> + set_opt2(sb, MB_OPTIMIZE_SCAN);
> +
> err = ext4_mb_init(sb);
> if (err) {
> ext4_msg(sb, KERN_ERR, "failed to initialize mballoc (%d)",
> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> index 59ca9d73b42f..16b8a838f631 100644
> --- a/fs/ext4/sysfs.c
> +++ b/fs/ext4/sysfs.c
> @@ -213,6 +213,7 @@ EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
> EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
> EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
> EXT4_RW_ATTR_SBI_UI(mb_max_inode_prealloc, s_mb_max_inode_prealloc);
> +EXT4_RW_ATTR_SBI_UI(mb_linear_limit, s_mb_linear_limit);
> EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
> EXT4_ATTR(trigger_fs_error, 0200, trigger_test_error);
> EXT4_RW_ATTR_SBI_UI(err_ratelimit_interval_ms, s_err_ratelimit_state.interval);
> @@ -260,6 +261,7 @@ static struct attribute *ext4_attrs[] = {
> ATTR_LIST(mb_stream_req),
> ATTR_LIST(mb_group_prealloc),
> ATTR_LIST(mb_max_inode_prealloc),
> + ATTR_LIST(mb_linear_limit),
> ATTR_LIST(max_writeback_mb_bump),
> ATTR_LIST(extent_max_zeroout_kb),
> ATTR_LIST(trigger_fs_error),
> --
> 2.31.0.291.g576ba9dcdaf-goog
>
Cheers, Andreas
Okay sounds good. I'll do that!
On Mon, Mar 29, 2021 at 12:14 PM Andreas Dilger <[email protected]> wrote:
>
> On Mar 24, 2021, at 5:19 PM, Harshad Shirwadkar <[email protected]> wrote:
> >
> > Instead of traversing through groups linearly, scan groups in specific
> > orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
> > largest free order >= the order of the request. So, with this patch,
> > we maintain lists for each possible order and insert each group into a
> > list based on the largest free order in its buddy bitmap. During cr 0
> > allocation, we traverse these lists in the increasing order of largest
> > free orders. This allows us to find a group with the best available cr
> > 0 match in constant time. If nothing can be found, we fallback to cr 1
> > immediately.
> >
> > At CR1, the story is slightly different. We want to traverse in the
> > order of increasing average fragment size. For CR1, we maintain a rb
> > tree of groupinfos which is sorted by average fragment size. Instead
> > of traversing linearly, at CR1, we traverse in the order of increasing
> > average fragment size, starting at the most optimal group. This brings
> > down cr 1 search complexity to log(num groups).
> >
> > For cr >= 2, we just perform the linear search as before. Also, in
> > case of lock contention, we intermittently fallback to linear search
> > even in CR 0 and CR 1 cases. This allows us to proceed during the
> > allocation path even in case of high contention.
>
> If you are refreshing this patch anyway, could you rename "mb_linear_limit"
> to "mb_linear_groups" or "mb_max_linear_groups" or similar? It otherwise
> isn't clear what units the "limit" is in (could be MB/GB or something else).
>
> Cheers, Andreas
>
> > There is an opportunity to do optimization at CR2 too. That's because
> > at CR2 we only consider groups where bb_free counter (number of free
> > blocks) is greater than the request extent size. That's left as future
> > work.
> >
> > All the changes introduced in this patch are protected under a new
> > mount option "mb_optimize_scan".
> >
> > With this patchset, following experiment was performed:
> >
> > Created a highly fragmented disk of size 65TB. The disk had no
> > contiguous 2M regions. Following command was run consecutively for 3
> > times:
> >
> > time dd if=/dev/urandom of=file bs=2M count=10
> >
> > Here are the results with and without cr 0/1 optimizations introduced
> > in this patch:
> >
> > |---------+------------------------------+---------------------------|
> > | | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
> > |---------+------------------------------+---------------------------|
> > | 1st run | 5m1.871s | 2m47.642s |
> > | 2nd run | 2m28.390s | 0m0.611s |
> > | 3rd run | 2m26.530s | 0m1.255s |
> > |---------+------------------------------+---------------------------|
> >
> > Signed-off-by: Harshad Shirwadkar <[email protected]>
> > Reported-by: kernel test robot <[email protected]>
> > Reported-by: Dan Carpenter <[email protected]>
> > Reviewed-by: Andreas Dilger <[email protected]>
> > ---
> > fs/ext4/ext4.h | 19 ++-
> > fs/ext4/mballoc.c | 381 ++++++++++++++++++++++++++++++++++++++++++++--
> > fs/ext4/mballoc.h | 17 ++-
> > fs/ext4/super.c | 28 +++-
> > fs/ext4/sysfs.c | 2 +
> > 5 files changed, 432 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 85eeeba3bca3..5930c8cb5159 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -162,7 +162,10 @@ enum SHIFT_DIRECTION {
> > #define EXT4_MB_USE_RESERVED 0x2000
> > /* Do strict check for free blocks while retrying block allocation */
> > #define EXT4_MB_STRICT_CHECK 0x4000
> > -
> > +/* Large fragment size list lookup succeeded at least once for cr = 0 */
> > +#define EXT4_MB_CR0_OPTIMIZED 0x8000
> > +/* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
> > +#define EXT4_MB_CR1_OPTIMIZED 0x00010000
> > struct ext4_allocation_request {
> > /* target inode for block we're allocating */
> > struct inode *inode;
> > @@ -1247,7 +1250,9 @@ struct ext4_inode_info {
> > #define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */
> > #define EXT4_MOUNT2_DAX_NEVER 0x00000020 /* Do not allow Direct Access */
> > #define EXT4_MOUNT2_DAX_INODE 0x00000040 /* For printing options only */
> > -
> > +#define EXT4_MOUNT2_MB_OPTIMIZE_SCAN 0x00000080 /* Optimize group
> > + * scanning in mballoc
> > + */
> >
> > #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
> > ~EXT4_MOUNT_##opt
> > @@ -1527,9 +1532,14 @@ struct ext4_sb_info {
> > unsigned int s_mb_free_pending;
> > struct list_head s_freed_data_list; /* List of blocks to be freed
> > after commit completed */
> > + struct rb_root s_mb_avg_fragment_size_root;
> > + rwlock_t s_mb_rb_lock;
> > + struct list_head *s_mb_largest_free_orders;
> > + rwlock_t *s_mb_largest_free_orders_locks;
> >
> > /* tunables */
> > unsigned long s_stripe;
> > + unsigned int s_mb_linear_limit;
> > unsigned int s_mb_stream_request;
> > unsigned int s_mb_max_to_scan;
> > unsigned int s_mb_min_to_scan;
> > @@ -1553,6 +1563,8 @@ struct ext4_sb_info {
> > atomic_t s_bal_goals; /* goal hits */
> > atomic_t s_bal_breaks; /* too long searches */
> > atomic_t s_bal_2orders; /* 2^order hits */
> > + atomic_t s_bal_cr0_bad_suggestions;
> > + atomic_t s_bal_cr1_bad_suggestions;
> > atomic64_t s_bal_cX_groups_considered[4];
> > atomic64_t s_bal_cX_hits[4];
> > atomic64_t s_bal_cX_failed[4]; /* cX loop didn't find blocks */
> > @@ -3309,11 +3321,14 @@ struct ext4_group_info {
> > ext4_grpblk_t bb_free; /* total free blocks */
> > ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
> > ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */
> > + ext4_group_t bb_group; /* Group number */
> > struct list_head bb_prealloc_list;
> > #ifdef DOUBLE_CHECK
> > void *bb_bitmap;
> > #endif
> > struct rw_semaphore alloc_sem;
> > + struct rb_node bb_avg_fragment_size_rb;
> > + struct list_head bb_largest_free_order_node;
> > ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
> > * regions, index is order.
> > * bb_counters[3] = 5 means
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 15127d815461..cbf9a89c0ef5 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -127,11 +127,50 @@
> > * smallest multiple of the stripe value (sbi->s_stripe) which is
> > * greater than the default mb_group_prealloc.
> > *
> > + * If "mb_optimize_scan" mount option is set, we maintain in memory group info
> > + * structures in two data structures:
> > + *
> > + * 1) Array of largest free order lists (sbi->s_mb_largest_free_orders)
> > + *
> > + * Locking: sbi->s_mb_largest_free_orders_locks(array of rw locks)
> > + *
> > + * This is an array of lists where the index in the array represents the
> > + * largest free order in the buddy bitmap of the participating group infos of
> > + * that list. So, there are exactly MB_NUM_ORDERS(sb) (which means total
> > + * number of buddy bitmap orders possible) number of lists. Group-infos are
> > + * placed in appropriate lists.
> > + *
> > + * 2) Average fragment size rb tree (sbi->s_mb_avg_fragment_size_root)
> > + *
> > + * Locking: sbi->s_mb_rb_lock (rwlock)
> > + *
> > + * This is a red black tree consisting of group infos and the tree is sorted
> > + * by average fragment sizes (which is calculated as ext4_group_info->bb_free
> > + * / ext4_group_info->bb_fragments).
> > + *
> > + * When "mb_optimize_scan" mount option is set, mballoc consults the above data
> > + * structures to decide the order in which groups are to be traversed for
> > + * fulfilling an allocation request.
> > + *
> > + * At CR = 0, we look for groups which have the largest_free_order >= the order
> > + * of the request. We directly look at the largest free order list in the data
> > + * structure (1) above where largest_free_order = order of the request. If that
> > + * list is empty, we look at remaining list in the increasing order of
> > + * largest_free_order. This allows us to perform CR = 0 lookup in O(1) time.
> > + *
> > + * At CR = 1, we only consider groups where average fragment size > request
> > + * size. So, we lookup a group which has average fragment size just above or
> > + * equal to request size using our rb tree (data structure 2) in O(log N) time.
> > + *
> > + * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
> > + * linear order which requires O(N) search time for each CR 0 and CR 1 phase.
> > + *
> > * The regular allocator (using the buddy cache) supports a few tunables.
> > *
> > * /sys/fs/ext4/<partition>/mb_min_to_scan
> > * /sys/fs/ext4/<partition>/mb_max_to_scan
> > * /sys/fs/ext4/<partition>/mb_order2_req
> > + * /sys/fs/ext4/<partition>/mb_linear_limit
> > *
> > * The regular allocator uses buddy scan only if the request len is power of
> > * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The
> > @@ -149,6 +188,16 @@
> > * can be used for allocation. ext4_mb_good_group explains how the groups are
> > * checked.
> > *
> > + * When "mb_optimize_scan" is turned on, as mentioned above, the groups may not
> > + * get traversed linearly. That may result in subsequent allocations being not
> > + * close to each other. And so, the underlying device may get filled up in a
> > + * non-linear fashion. While that may not matter on non-rotational devices, for
> > + * rotational devices that may result in higher seek times. "mb_linear_limit"
> > + * tells mballoc how many groups mballoc should search linearly before
> > + * performing consulting above data structures for more efficient lookups. For
> > + * non rotational devices, this value defaults to 0 and for rotational devices
> > + * this is set to MB_DEFAULT_LINEAR_LIMIT.
> > + *
> > * Both the prealloc space are getting populated as above. So for the first
> > * request we will hit the buddy cache which will result in this prealloc
> > * space getting filled. The prealloc space is then later used for the
> > @@ -299,6 +348,8 @@
> > * - bitlock on a group (group)
> > * - object (inode/locality) (object)
> > * - per-pa lock (pa)
> > + * - cr0 lists lock (cr0)
> > + * - cr1 tree lock (cr1)
> > *
> > * Paths:
> > * - new pa
> > @@ -328,6 +379,9 @@
> > * group
> > * object
> > *
> > + * - allocation path (ext4_mb_regular_allocator)
> > + * group
> > + * cr0/cr1
> > */
> > static struct kmem_cache *ext4_pspace_cachep;
> > static struct kmem_cache *ext4_ac_cachep;
> > @@ -351,6 +405,9 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
> > ext4_group_t group);
> > static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
> >
> > +static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
> > + ext4_group_t group, int cr);
> > +
> > /*
> > * The algorithm using this percpu seq counter goes below:
> > * 1. We sample the percpu discard_pa_seq counter before trying for block
> > @@ -744,6 +801,251 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> > }
> > }
> >
> > +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
> > + int (*cmp)(struct rb_node *, struct rb_node *))
> > +{
> > + struct rb_node **iter = &root->rb_node, *parent = NULL;
> > +
> > + while (*iter) {
> > + parent = *iter;
> > + if (cmp(new, *iter) > 0)
> > + iter = &((*iter)->rb_left);
> > + else
> > + iter = &((*iter)->rb_right);
> > + }
> > +
> > + rb_link_node(new, parent, iter);
> > + rb_insert_color(new, root);
> > +}
> > +
> > +static int
> > +ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
> > +{
> > + struct ext4_group_info *grp1 = rb_entry(rb1,
> > + struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + struct ext4_group_info *grp2 = rb_entry(rb2,
> > + struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + int num_frags_1, num_frags_2;
> > +
> > + num_frags_1 = grp1->bb_fragments ?
> > + grp1->bb_free / grp1->bb_fragments : 0;
> > + num_frags_2 = grp2->bb_fragments ?
> > + grp2->bb_free / grp2->bb_fragments : 0;
> > +
> > + return (num_frags_2 - num_frags_1);
> > +}
> > +
> > +/*
> > + * Reinsert grpinfo into the avg_fragment_size tree with new average
> > + * fragment size.
> > + */
> > +static void
> > +mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +
> > + if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
> > + return;
> > +
> > + write_lock(&sbi->s_mb_rb_lock);
> > + if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
> > + rb_erase(&grp->bb_avg_fragment_size_rb,
> > + &sbi->s_mb_avg_fragment_size_root);
> > + RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
> > + }
> > +
> > + ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
> > + &grp->bb_avg_fragment_size_rb,
> > + ext4_mb_avg_fragment_size_cmp);
> > + write_unlock(&sbi->s_mb_rb_lock);
> > +}
> > +
> > +/*
> > + * Choose next group by traversing largest_free_order lists. Return 0 if next
> > + * group was selected optimally. Return 1 if next group was not selected
> > + * optimally. Updates *new_cr if cr level needs an update.
> > + */
> > +static int ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> > + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> > + struct ext4_group_info *iter, *grp;
> > + int i;
> > +
> > + if (ac->ac_status == AC_STATUS_FOUND)
> > + return 1;
> > +
> > + if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
> > + atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
> > +
> > + grp = NULL;
> > + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> > + if (list_empty(&sbi->s_mb_largest_free_orders[i]))
> > + continue;
> > + read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
> > + if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
> > + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> > + continue;
> > + }
> > + grp = NULL;
> > + list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
> > + bb_largest_free_order_node) {
> > + if (sbi->s_mb_stats)
> > + atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
> > + if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
> > + grp = iter;
> > + break;
> > + }
> > + }
> > + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> > + if (grp)
> > + break;
> > + }
> > +
> > + if (!grp) {
> > + /* Increment cr and search again */
> > + *new_cr = 1;
> > + } else {
> > + *group = grp->bb_group;
> > + ac->ac_last_optimal_group = *group;
> > + ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
> > + }
> > + return 0;
> > +}
> > +
> > +/*
> > + * Choose next group by traversing average fragment size tree. Return 0 if next
> > + * group was selected optimally. Return 1 if next group could not selected
> > + * optimally (due to lock contention). Updates *new_cr if cr lvel needs an
> > + * update.
> > + */
> > +static int ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> > + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> > + int avg_fragment_size, best_so_far;
> > + struct rb_node *node, *found;
> > + struct ext4_group_info *grp;
> > +
> > + /*
> > + * If there is contention on the lock, instead of waiting for the lock
> > + * to become available, just continue searching lineraly. We'll resume
> > + * our rb tree search later starting at ac->ac_last_optimal_group.
> > + */
> > + if (!read_trylock(&sbi->s_mb_rb_lock))
> > + return 1;
> > +
> > + if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
> > + if (sbi->s_mb_stats)
> > + atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
> > + /* We have found something at CR 1 in the past */
> > + grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
> > + for (found = rb_next(&grp->bb_avg_fragment_size_rb); found != NULL;
> > + found = rb_next(found)) {
> > + grp = rb_entry(found, struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + if (sbi->s_mb_stats)
> > + atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
> > + if (likely(ext4_mb_good_group(ac, grp->bb_group, 1)))
> > + break;
> > + }
> > +
> > + goto done;
> > + }
> > +
> > + node = sbi->s_mb_avg_fragment_size_root.rb_node;
> > + best_so_far = 0;
> > + found = NULL;
> > +
> > + while (node) {
> > + grp = rb_entry(node, struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + avg_fragment_size = 0;
> > + /*
> > + * Perform this check without locking, we'll lock later to confirm.
> > + */
> > + if (ext4_mb_good_group(ac, grp->bb_group, 1)) {
> > + avg_fragment_size = grp->bb_fragments ?
> > + grp->bb_free / grp->bb_fragments : 0;
> > + if (!best_so_far || avg_fragment_size < best_so_far) {
> > + best_so_far = avg_fragment_size;
> > + found = node;
> > + }
> > + }
> > + if (avg_fragment_size > ac->ac_g_ex.fe_len)
> > + node = node->rb_right;
> > + else
> > + node = node->rb_left;
> > + }
> > +
> > +done:
> > + if (found) {
> > + grp = rb_entry(found, struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + *group = grp->bb_group;
> > + ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
> > + } else {
> > + *new_cr = 2;
> > + }
> > +
> > + read_unlock(&sbi->s_mb_rb_lock);
> > + ac->ac_last_optimal_group = *group;
> > + return 0;
> > +}
> > +
> > +/*
> > + * ext4_mb_choose_next_group: choose next group for allocation.
> > + *
> > + * @ac Allocation Context
> > + * @new_cr This is an output parameter. If the there is no good group
> > + * available at current CR level, this field is updated to indicate
> > + * the new cr level that should be used.
> > + * @group This is an input / output parameter. As an input it indicates the
> > + * last group used for allocation. As output, this field indicates
> > + * the next group that should be used.
> > + * @ngroups Total number of groups
> > + */
> > +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> > + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> > +{
> > + int ret;
> > +
> > + *new_cr = ac->ac_criteria;
> > +
> > + if (!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN) ||
> > + *new_cr >= 2 ||
> > + !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
> > + goto inc_and_return;
> > +
> > + if (ac->ac_groups_linear_remaining) {
> > + ac->ac_groups_linear_remaining--;
> > + goto inc_and_return;
> > + }
> > +
> > + if (*new_cr == 0) {
> > + ret = ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
> > + if (ret)
> > + goto inc_and_return;
> > + }
> > + if (*new_cr == 1) {
> > + ret = ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
> > + if (ret)
> > + goto inc_and_return;
> > + }
> > + return;
> > +
> > +inc_and_return:
> > + /*
> > + * Artificially restricted ngroups for non-extent
> > + * files makes group > ngroups possible on first loop.
> > + */
> > + *group = *group + 1;
> > + if (*group >= ngroups)
> > + *group = 0;
> > +}
> > +
> > /*
> > * Cache the order of the largest free extent we have available in this block
> > * group.
> > @@ -751,18 +1053,33 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> > static void
> > mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
> > {
> > + struct ext4_sb_info *sbi = EXT4_SB(sb);
> > int i;
> > - int bits;
> >
> > + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
> > + write_lock(&sbi->s_mb_largest_free_orders_locks[
> > + grp->bb_largest_free_order]);
> > + list_del_init(&grp->bb_largest_free_order_node);
> > + write_unlock(&sbi->s_mb_largest_free_orders_locks[
> > + grp->bb_largest_free_order]);
> > + }
> > grp->bb_largest_free_order = -1; /* uninit */
> >
> > - bits = MB_NUM_ORDERS(sb) - 1;
> > - for (i = bits; i >= 0; i--) {
> > + for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
> > if (grp->bb_counters[i] > 0) {
> > grp->bb_largest_free_order = i;
> > break;
> > }
> > }
> > + if (test_opt2(sb, MB_OPTIMIZE_SCAN) &&
> > + grp->bb_largest_free_order >= 0 && grp->bb_free) {
> > + write_lock(&sbi->s_mb_largest_free_orders_locks[
> > + grp->bb_largest_free_order]);
> > + list_add_tail(&grp->bb_largest_free_order_node,
> > + &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> > + write_unlock(&sbi->s_mb_largest_free_orders_locks[
> > + grp->bb_largest_free_order]);
> > + }
> > }
> >
> > static noinline_for_stack
> > @@ -818,6 +1135,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
> > period = get_cycles() - period;
> > atomic_inc(&sbi->s_mb_buddies_generated);
> > atomic64_add(period, &sbi->s_mb_generation_time);
> > + mb_update_avg_fragment_size(sb, grp);
> > }
> >
> > /* The buddy information is attached the buddy cache inode
> > @@ -1517,6 +1835,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
> >
> > done:
> > mb_set_largest_free_order(sb, e4b->bd_info);
> > + mb_update_avg_fragment_size(sb, e4b->bd_info);
> > mb_check_buddy(e4b);
> > }
> >
> > @@ -1653,6 +1972,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
> > }
> > mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
> >
> > + mb_update_avg_fragment_size(e4b->bd_sb, e4b->bd_info);
> > ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
> > mb_check_buddy(e4b);
> >
> > @@ -2347,17 +2667,21 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> > * from the goal value specified
> > */
> > group = ac->ac_g_ex.fe_group;
> > + ac->ac_last_optimal_group = group;
> > + ac->ac_groups_linear_remaining = sbi->s_mb_linear_limit;
> > prefetch_grp = group;
> >
> > - for (i = 0; i < ngroups; group++, i++) {
> > - int ret = 0;
> > + for (i = 0; i < ngroups; i++) {
> > + int ret = 0, new_cr;
> > +
> > cond_resched();
> > - /*
> > - * Artificially restricted ngroups for non-extent
> > - * files makes group > ngroups possible on first loop.
> > - */
> > - if (group >= ngroups)
> > - group = 0;
> > +
> > + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
> > +
> > + if (new_cr != cr) {
> > + cr = new_cr;
> > + goto repeat;
> > + }
> >
> > /*
> > * Batch reads of the block allocation bitmaps
> > @@ -2578,6 +2902,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
> > atomic64_read(&sbi->s_bal_cX_groups_considered[0]));
> > seq_printf(seq, "\t\tuseless_loops: %llu\n",
> > atomic64_read(&sbi->s_bal_cX_failed[0]));
> > + seq_printf(seq, "\t\tbad_suggestions: %u\n",
> > + atomic_read(&sbi->s_bal_cr0_bad_suggestions));
> >
> > seq_puts(seq, "\tcr1_stats:\n");
> > seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[1]));
> > @@ -2585,6 +2911,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
> > atomic64_read(&sbi->s_bal_cX_groups_considered[1]));
> > seq_printf(seq, "\t\tuseless_loops: %llu\n",
> > atomic64_read(&sbi->s_bal_cX_failed[1]));
> > + seq_printf(seq, "\t\tbad_suggestions: %u\n",
> > + atomic_read(&sbi->s_bal_cr1_bad_suggestions));
> >
> > seq_puts(seq, "\tcr2_stats:\n");
> > seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[2]));
> > @@ -2719,7 +3047,10 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
> > INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
> > init_rwsem(&meta_group_info[i]->alloc_sem);
> > meta_group_info[i]->bb_free_root = RB_ROOT;
> > + INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
> > + RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb);
> > meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
> > + meta_group_info[i]->bb_group = group;
> >
> > mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
> > return 0;
> > @@ -2909,6 +3240,26 @@ int ext4_mb_init(struct super_block *sb)
> > i++;
> > } while (i < MB_NUM_ORDERS(sb));
> >
> > + sbi->s_mb_avg_fragment_size_root = RB_ROOT;
> > + sbi->s_mb_largest_free_orders =
> > + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> > + GFP_KERNEL);
> > + if (!sbi->s_mb_largest_free_orders) {
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > + sbi->s_mb_largest_free_orders_locks =
> > + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> > + GFP_KERNEL);
> > + if (!sbi->s_mb_largest_free_orders_locks) {
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > + for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> > + INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
> > + rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
> > + }
> > + rwlock_init(&sbi->s_mb_rb_lock);
> >
> > spin_lock_init(&sbi->s_md_lock);
> > sbi->s_mb_free_pending = 0;
> > @@ -2961,6 +3312,10 @@ int ext4_mb_init(struct super_block *sb)
> > spin_lock_init(&lg->lg_prealloc_lock);
> > }
> >
> > + if (blk_queue_nonrot(bdev_get_queue(sb->s_bdev)))
> > + sbi->s_mb_linear_limit = 0;
> > + else
> > + sbi->s_mb_linear_limit = MB_DEFAULT_LINEAR_LIMIT;
> > /* init file for buddy data */
> > ret = ext4_mb_init_backend(sb);
> > if (ret != 0)
> > @@ -2972,6 +3327,8 @@ int ext4_mb_init(struct super_block *sb)
> > free_percpu(sbi->s_locality_groups);
> > sbi->s_locality_groups = NULL;
> > out:
> > + kfree(sbi->s_mb_largest_free_orders);
> > + kfree(sbi->s_mb_largest_free_orders_locks);
> > kfree(sbi->s_mb_offsets);
> > sbi->s_mb_offsets = NULL;
> > kfree(sbi->s_mb_maxs);
> > @@ -3028,6 +3385,8 @@ int ext4_mb_release(struct super_block *sb)
> > kvfree(group_info);
> > rcu_read_unlock();
> > }
> > + kfree(sbi->s_mb_largest_free_orders);
> > + kfree(sbi->s_mb_largest_free_orders_locks);
> > kfree(sbi->s_mb_offsets);
> > kfree(sbi->s_mb_maxs);
> > iput(sbi->s_buddy_cache);
> > diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> > index 68111a10cfee..02585e3cbcad 100644
> > --- a/fs/ext4/mballoc.h
> > +++ b/fs/ext4/mballoc.h
> > @@ -78,6 +78,18 @@
> > */
> > #define MB_DEFAULT_MAX_INODE_PREALLOC 512
> >
> > +/*
> > + * Number of groups to search linearly before performing group scanning
> > + * optimization.
> > + */
> > +#define MB_DEFAULT_LINEAR_LIMIT 4
> > +
> > +/*
> > + * Minimum number of groups that should be present in the file system to perform
> > + * group scanning optimizations.
> > + */
> > +#define MB_DEFAULT_LINEAR_SCAN_THRESHOLD 16
> > +
> > /*
> > * Number of valid buddy orders
> > */
> > @@ -166,11 +178,14 @@ struct ext4_allocation_context {
> > /* copy of the best found extent taken before preallocation efforts */
> > struct ext4_free_extent ac_f_ex;
> >
> > + ext4_group_t ac_last_optimal_group;
> > + __u32 ac_groups_considered;
> > + __u32 ac_flags; /* allocation hints */
> > __u16 ac_groups_scanned;
> > + __u16 ac_groups_linear_remaining;
> > __u16 ac_found;
> > __u16 ac_tail;
> > __u16 ac_buddy;
> > - __u16 ac_flags; /* allocation hints */
> > __u8 ac_status;
> > __u8 ac_criteria;
> > __u8 ac_2order; /* if request is to allocate 2^N blocks and
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 3a2cd9fb7e73..6116640081c0 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -1687,7 +1687,7 @@ enum {
> > Opt_dioread_nolock, Opt_dioread_lock,
> > Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
> > Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
> > - Opt_prefetch_block_bitmaps,
> > + Opt_prefetch_block_bitmaps, Opt_mb_optimize_scan,
> > #ifdef CONFIG_EXT4_DEBUG
> > Opt_fc_debug_max_replay, Opt_fc_debug_force
> > #endif
> > @@ -1788,6 +1788,7 @@ static const match_table_t tokens = {
> > {Opt_nombcache, "nombcache"},
> > {Opt_nombcache, "no_mbcache"}, /* for backward compatibility */
> > {Opt_prefetch_block_bitmaps, "prefetch_block_bitmaps"},
> > + {Opt_mb_optimize_scan, "mb_optimize_scan=%d"},
> > {Opt_removed, "check=none"}, /* mount option from ext2/3 */
> > {Opt_removed, "nocheck"}, /* mount option from ext2/3 */
> > {Opt_removed, "reservation"}, /* mount option from ext2/3 */
> > @@ -1820,6 +1821,8 @@ static ext4_fsblk_t get_sb_block(void **data)
> > }
> >
> > #define DEFAULT_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
> > +#define DEFAULT_MB_OPTIMIZE_SCAN (-1)
> > +
> > static const char deprecated_msg[] =
> > "Mount option \"%s\" will be removed by %s\n"
> > "Contact [email protected] if you think we should keep it.\n";
> > @@ -2008,6 +2011,7 @@ static const struct mount_opts {
> > {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
> > {Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
> > MOPT_SET},
> > + {Opt_mb_optimize_scan, EXT4_MOUNT2_MB_OPTIMIZE_SCAN, MOPT_GTE0},
> > #ifdef CONFIG_EXT4_DEBUG
> > {Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
> > MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
> > @@ -2092,6 +2096,7 @@ static int ext4_set_test_dummy_encryption(struct super_block *sb,
> > struct ext4_parsed_options {
> > unsigned long journal_devnum;
> > unsigned int journal_ioprio;
> > + int mb_optimize_scan;
> > };
> >
> > static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> > @@ -2388,6 +2393,13 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> > sbi->s_mount_opt |= m->mount_opt;
> > } else if (token == Opt_data_err_ignore) {
> > sbi->s_mount_opt &= ~m->mount_opt;
> > + } else if (token == Opt_mb_optimize_scan) {
> > + if (arg != 0 && arg != 1) {
> > + ext4_msg(sb, KERN_WARNING,
> > + "mb_optimize_scan should be set to 0 or 1.");
> > + return -1;
> > + }
> > + parsed_opts->mb_optimize_scan = arg;
> > } else {
> > if (!args->from)
> > arg = 1;
> > @@ -4034,6 +4046,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> > /* Set defaults for the variables that will be set during parsing */
> > parsed_opts.journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
> > parsed_opts.journal_devnum = 0;
> > + parsed_opts.mb_optimize_scan = DEFAULT_MB_OPTIMIZE_SCAN;
> >
> > if ((data && !orig_data) || !sbi)
> > goto out_free_base;
> > @@ -4984,6 +4997,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> > ext4_fc_replay_cleanup(sb);
> >
> > ext4_ext_init(sb);
> > +
> > + /*
> > + * Enable optimize_scan if number of groups is > threshold. This can be
> > + * turned off by passing "mb_optimize_scan=0". This can also be
> > + * turned on forcefully by passing "mb_optimize_scan=1".
> > + */
> > + if (parsed_opts.mb_optimize_scan == 1)
> > + set_opt2(sb, MB_OPTIMIZE_SCAN);
> > + else if (parsed_opts.mb_optimize_scan == 0)
> > + clear_opt2(sb, MB_OPTIMIZE_SCAN);
> > + else if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD)
> > + set_opt2(sb, MB_OPTIMIZE_SCAN);
> > +
> > err = ext4_mb_init(sb);
> > if (err) {
> > ext4_msg(sb, KERN_ERR, "failed to initialize mballoc (%d)",
> > diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> > index 59ca9d73b42f..16b8a838f631 100644
> > --- a/fs/ext4/sysfs.c
> > +++ b/fs/ext4/sysfs.c
> > @@ -213,6 +213,7 @@ EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
> > EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
> > EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
> > EXT4_RW_ATTR_SBI_UI(mb_max_inode_prealloc, s_mb_max_inode_prealloc);
> > +EXT4_RW_ATTR_SBI_UI(mb_linear_limit, s_mb_linear_limit);
> > EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
> > EXT4_ATTR(trigger_fs_error, 0200, trigger_test_error);
> > EXT4_RW_ATTR_SBI_UI(err_ratelimit_interval_ms, s_err_ratelimit_state.interval);
> > @@ -260,6 +261,7 @@ static struct attribute *ext4_attrs[] = {
> > ATTR_LIST(mb_stream_req),
> > ATTR_LIST(mb_group_prealloc),
> > ATTR_LIST(mb_max_inode_prealloc),
> > + ATTR_LIST(mb_linear_limit),
> > ATTR_LIST(max_writeback_mb_bump),
> > ATTR_LIST(extent_max_zeroout_kb),
> > ATTR_LIST(trigger_fs_error),
> > --
> > 2.31.0.291.g576ba9dcdaf-goog
> >
>
>
> Cheers, Andreas
>
>
>
>
>
Another problem that I found in the patch was that when all groups are
not loaded, the optimization concludes that CR0 and CR1 cannot be used
for allocation (since these structures aren't populated). The problem
goes away if we also mount with prefetch_block_bitmaps mount option.
But I think if all the groups are not loaded, we should at least fall
back to linear search at CR1. Maybe skipping CR0 is okay..
On Mon, Mar 29, 2021 at 12:32 PM harshad shirwadkar
<[email protected]> wrote:
>
> Okay sounds good. I'll do that!
>
> On Mon, Mar 29, 2021 at 12:14 PM Andreas Dilger <[email protected]> wrote:
> >
> > On Mar 24, 2021, at 5:19 PM, Harshad Shirwadkar <[email protected]> wrote:
> > >
> > > Instead of traversing through groups linearly, scan groups in specific
> > > orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
> > > largest free order >= the order of the request. So, with this patch,
> > > we maintain lists for each possible order and insert each group into a
> > > list based on the largest free order in its buddy bitmap. During cr 0
> > > allocation, we traverse these lists in the increasing order of largest
> > > free orders. This allows us to find a group with the best available cr
> > > 0 match in constant time. If nothing can be found, we fallback to cr 1
> > > immediately.
> > >
> > > At CR1, the story is slightly different. We want to traverse in the
> > > order of increasing average fragment size. For CR1, we maintain a rb
> > > tree of groupinfos which is sorted by average fragment size. Instead
> > > of traversing linearly, at CR1, we traverse in the order of increasing
> > > average fragment size, starting at the most optimal group. This brings
> > > down cr 1 search complexity to log(num groups).
> > >
> > > For cr >= 2, we just perform the linear search as before. Also, in
> > > case of lock contention, we intermittently fallback to linear search
> > > even in CR 0 and CR 1 cases. This allows us to proceed during the
> > > allocation path even in case of high contention.
> >
> > If you are refreshing this patch anyway, could you rename "mb_linear_limit"
> > to "mb_linear_groups" or "mb_max_linear_groups" or similar? It otherwise
> > isn't clear what units the "limit" is in (could be MB/GB or something else).
> >
> > Cheers, Andreas
> >
> > > There is an opportunity to do optimization at CR2 too. That's because
> > > at CR2 we only consider groups where bb_free counter (number of free
> > > blocks) is greater than the request extent size. That's left as future
> > > work.
> > >
> > > All the changes introduced in this patch are protected under a new
> > > mount option "mb_optimize_scan".
> > >
> > > With this patchset, following experiment was performed:
> > >
> > > Created a highly fragmented disk of size 65TB. The disk had no
> > > contiguous 2M regions. Following command was run consecutively for 3
> > > times:
> > >
> > > time dd if=/dev/urandom of=file bs=2M count=10
> > >
> > > Here are the results with and without cr 0/1 optimizations introduced
> > > in this patch:
> > >
> > > |---------+------------------------------+---------------------------|
> > > | | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
> > > |---------+------------------------------+---------------------------|
> > > | 1st run | 5m1.871s | 2m47.642s |
> > > | 2nd run | 2m28.390s | 0m0.611s |
> > > | 3rd run | 2m26.530s | 0m1.255s |
> > > |---------+------------------------------+---------------------------|
> > >
> > > Signed-off-by: Harshad Shirwadkar <[email protected]>
> > > Reported-by: kernel test robot <[email protected]>
> > > Reported-by: Dan Carpenter <[email protected]>
> > > Reviewed-by: Andreas Dilger <[email protected]>
> > > ---
> > > fs/ext4/ext4.h | 19 ++-
> > > fs/ext4/mballoc.c | 381 ++++++++++++++++++++++++++++++++++++++++++++--
> > > fs/ext4/mballoc.h | 17 ++-
> > > fs/ext4/super.c | 28 +++-
> > > fs/ext4/sysfs.c | 2 +
> > > 5 files changed, 432 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > index 85eeeba3bca3..5930c8cb5159 100644
> > > --- a/fs/ext4/ext4.h
> > > +++ b/fs/ext4/ext4.h
> > > @@ -162,7 +162,10 @@ enum SHIFT_DIRECTION {
> > > #define EXT4_MB_USE_RESERVED 0x2000
> > > /* Do strict check for free blocks while retrying block allocation */
> > > #define EXT4_MB_STRICT_CHECK 0x4000
> > > -
> > > +/* Large fragment size list lookup succeeded at least once for cr = 0 */
> > > +#define EXT4_MB_CR0_OPTIMIZED 0x8000
> > > +/* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
> > > +#define EXT4_MB_CR1_OPTIMIZED 0x00010000
> > > struct ext4_allocation_request {
> > > /* target inode for block we're allocating */
> > > struct inode *inode;
> > > @@ -1247,7 +1250,9 @@ struct ext4_inode_info {
> > > #define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */
> > > #define EXT4_MOUNT2_DAX_NEVER 0x00000020 /* Do not allow Direct Access */
> > > #define EXT4_MOUNT2_DAX_INODE 0x00000040 /* For printing options only */
> > > -
> > > +#define EXT4_MOUNT2_MB_OPTIMIZE_SCAN 0x00000080 /* Optimize group
> > > + * scanning in mballoc
> > > + */
> > >
> > > #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
> > > ~EXT4_MOUNT_##opt
> > > @@ -1527,9 +1532,14 @@ struct ext4_sb_info {
> > > unsigned int s_mb_free_pending;
> > > struct list_head s_freed_data_list; /* List of blocks to be freed
> > > after commit completed */
> > > + struct rb_root s_mb_avg_fragment_size_root;
> > > + rwlock_t s_mb_rb_lock;
> > > + struct list_head *s_mb_largest_free_orders;
> > > + rwlock_t *s_mb_largest_free_orders_locks;
> > >
> > > /* tunables */
> > > unsigned long s_stripe;
> > > + unsigned int s_mb_linear_limit;
> > > unsigned int s_mb_stream_request;
> > > unsigned int s_mb_max_to_scan;
> > > unsigned int s_mb_min_to_scan;
> > > @@ -1553,6 +1563,8 @@ struct ext4_sb_info {
> > > atomic_t s_bal_goals; /* goal hits */
> > > atomic_t s_bal_breaks; /* too long searches */
> > > atomic_t s_bal_2orders; /* 2^order hits */
> > > + atomic_t s_bal_cr0_bad_suggestions;
> > > + atomic_t s_bal_cr1_bad_suggestions;
> > > atomic64_t s_bal_cX_groups_considered[4];
> > > atomic64_t s_bal_cX_hits[4];
> > > atomic64_t s_bal_cX_failed[4]; /* cX loop didn't find blocks */
> > > @@ -3309,11 +3321,14 @@ struct ext4_group_info {
> > > ext4_grpblk_t bb_free; /* total free blocks */
> > > ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
> > > ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */
> > > + ext4_group_t bb_group; /* Group number */
> > > struct list_head bb_prealloc_list;
> > > #ifdef DOUBLE_CHECK
> > > void *bb_bitmap;
> > > #endif
> > > struct rw_semaphore alloc_sem;
> > > + struct rb_node bb_avg_fragment_size_rb;
> > > + struct list_head bb_largest_free_order_node;
> > > ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
> > > * regions, index is order.
> > > * bb_counters[3] = 5 means
> > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > index 15127d815461..cbf9a89c0ef5 100644
> > > --- a/fs/ext4/mballoc.c
> > > +++ b/fs/ext4/mballoc.c
> > > @@ -127,11 +127,50 @@
> > > * smallest multiple of the stripe value (sbi->s_stripe) which is
> > > * greater than the default mb_group_prealloc.
> > > *
> > > + * If "mb_optimize_scan" mount option is set, we maintain in memory group info
> > > + * structures in two data structures:
> > > + *
> > > + * 1) Array of largest free order lists (sbi->s_mb_largest_free_orders)
> > > + *
> > > + * Locking: sbi->s_mb_largest_free_orders_locks(array of rw locks)
> > > + *
> > > + * This is an array of lists where the index in the array represents the
> > > + * largest free order in the buddy bitmap of the participating group infos of
> > > + * that list. So, there are exactly MB_NUM_ORDERS(sb) (which means total
> > > + * number of buddy bitmap orders possible) number of lists. Group-infos are
> > > + * placed in appropriate lists.
> > > + *
> > > + * 2) Average fragment size rb tree (sbi->s_mb_avg_fragment_size_root)
> > > + *
> > > + * Locking: sbi->s_mb_rb_lock (rwlock)
> > > + *
> > > + * This is a red black tree consisting of group infos and the tree is sorted
> > > + * by average fragment sizes (which is calculated as ext4_group_info->bb_free
> > > + * / ext4_group_info->bb_fragments).
> > > + *
> > > + * When "mb_optimize_scan" mount option is set, mballoc consults the above data
> > > + * structures to decide the order in which groups are to be traversed for
> > > + * fulfilling an allocation request.
> > > + *
> > > + * At CR = 0, we look for groups which have the largest_free_order >= the order
> > > + * of the request. We directly look at the largest free order list in the data
> > > + * structure (1) above where largest_free_order = order of the request. If that
> > > + * list is empty, we look at remaining list in the increasing order of
> > > + * largest_free_order. This allows us to perform CR = 0 lookup in O(1) time.
> > > + *
> > > + * At CR = 1, we only consider groups where average fragment size > request
> > > + * size. So, we lookup a group which has average fragment size just above or
> > > + * equal to request size using our rb tree (data structure 2) in O(log N) time.
> > > + *
> > > + * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
> > > + * linear order which requires O(N) search time for each CR 0 and CR 1 phase.
> > > + *
> > > * The regular allocator (using the buddy cache) supports a few tunables.
> > > *
> > > * /sys/fs/ext4/<partition>/mb_min_to_scan
> > > * /sys/fs/ext4/<partition>/mb_max_to_scan
> > > * /sys/fs/ext4/<partition>/mb_order2_req
> > > + * /sys/fs/ext4/<partition>/mb_linear_limit
> > > *
> > > * The regular allocator uses buddy scan only if the request len is power of
> > > * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The
> > > @@ -149,6 +188,16 @@
> > > * can be used for allocation. ext4_mb_good_group explains how the groups are
> > > * checked.
> > > *
> > > + * When "mb_optimize_scan" is turned on, as mentioned above, the groups may not
> > > + * get traversed linearly. That may result in subsequent allocations being not
> > > + * close to each other. And so, the underlying device may get filled up in a
> > > + * non-linear fashion. While that may not matter on non-rotational devices, for
> > > + * rotational devices that may result in higher seek times. "mb_linear_limit"
> > > + * tells mballoc how many groups mballoc should search linearly before
> > > + * performing consulting above data structures for more efficient lookups. For
> > > + * non rotational devices, this value defaults to 0 and for rotational devices
> > > + * this is set to MB_DEFAULT_LINEAR_LIMIT.
> > > + *
> > > * Both the prealloc space are getting populated as above. So for the first
> > > * request we will hit the buddy cache which will result in this prealloc
> > > * space getting filled. The prealloc space is then later used for the
> > > @@ -299,6 +348,8 @@
> > > * - bitlock on a group (group)
> > > * - object (inode/locality) (object)
> > > * - per-pa lock (pa)
> > > + * - cr0 lists lock (cr0)
> > > + * - cr1 tree lock (cr1)
> > > *
> > > * Paths:
> > > * - new pa
> > > @@ -328,6 +379,9 @@
> > > * group
> > > * object
> > > *
> > > + * - allocation path (ext4_mb_regular_allocator)
> > > + * group
> > > + * cr0/cr1
> > > */
> > > static struct kmem_cache *ext4_pspace_cachep;
> > > static struct kmem_cache *ext4_ac_cachep;
> > > @@ -351,6 +405,9 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
> > > ext4_group_t group);
> > > static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
> > >
> > > +static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
> > > + ext4_group_t group, int cr);
> > > +
> > > /*
> > > * The algorithm using this percpu seq counter goes below:
> > > * 1. We sample the percpu discard_pa_seq counter before trying for block
> > > @@ -744,6 +801,251 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> > > }
> > > }
> > >
> > > +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
> > > + int (*cmp)(struct rb_node *, struct rb_node *))
> > > +{
> > > + struct rb_node **iter = &root->rb_node, *parent = NULL;
> > > +
> > > + while (*iter) {
> > > + parent = *iter;
> > > + if (cmp(new, *iter) > 0)
> > > + iter = &((*iter)->rb_left);
> > > + else
> > > + iter = &((*iter)->rb_right);
> > > + }
> > > +
> > > + rb_link_node(new, parent, iter);
> > > + rb_insert_color(new, root);
> > > +}
> > > +
> > > +static int
> > > +ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
> > > +{
> > > + struct ext4_group_info *grp1 = rb_entry(rb1,
> > > + struct ext4_group_info,
> > > + bb_avg_fragment_size_rb);
> > > + struct ext4_group_info *grp2 = rb_entry(rb2,
> > > + struct ext4_group_info,
> > > + bb_avg_fragment_size_rb);
> > > + int num_frags_1, num_frags_2;
> > > +
> > > + num_frags_1 = grp1->bb_fragments ?
> > > + grp1->bb_free / grp1->bb_fragments : 0;
> > > + num_frags_2 = grp2->bb_fragments ?
> > > + grp2->bb_free / grp2->bb_fragments : 0;
> > > +
> > > + return (num_frags_2 - num_frags_1);
> > > +}
> > > +
> > > +/*
> > > + * Reinsert grpinfo into the avg_fragment_size tree with new average
> > > + * fragment size.
> > > + */
> > > +static void
> > > +mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
> > > +{
> > > + struct ext4_sb_info *sbi = EXT4_SB(sb);
> > > +
> > > + if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
> > > + return;
> > > +
> > > + write_lock(&sbi->s_mb_rb_lock);
> > > + if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
> > > + rb_erase(&grp->bb_avg_fragment_size_rb,
> > > + &sbi->s_mb_avg_fragment_size_root);
> > > + RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
> > > + }
> > > +
> > > + ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
> > > + &grp->bb_avg_fragment_size_rb,
> > > + ext4_mb_avg_fragment_size_cmp);
> > > + write_unlock(&sbi->s_mb_rb_lock);
> > > +}
> > > +
> > > +/*
> > > + * Choose next group by traversing largest_free_order lists. Return 0 if next
> > > + * group was selected optimally. Return 1 if next group was not selected
> > > + * optimally. Updates *new_cr if cr level needs an update.
> > > + */
> > > +static int ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> > > + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> > > +{
> > > + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> > > + struct ext4_group_info *iter, *grp;
> > > + int i;
> > > +
> > > + if (ac->ac_status == AC_STATUS_FOUND)
> > > + return 1;
> > > +
> > > + if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
> > > + atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
> > > +
> > > + grp = NULL;
> > > + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> > > + if (list_empty(&sbi->s_mb_largest_free_orders[i]))
> > > + continue;
> > > + read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
> > > + if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
> > > + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> > > + continue;
> > > + }
> > > + grp = NULL;
> > > + list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
> > > + bb_largest_free_order_node) {
> > > + if (sbi->s_mb_stats)
> > > + atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
> > > + if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
> > > + grp = iter;
> > > + break;
> > > + }
> > > + }
> > > + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> > > + if (grp)
> > > + break;
> > > + }
> > > +
> > > + if (!grp) {
> > > + /* Increment cr and search again */
> > > + *new_cr = 1;
> > > + } else {
> > > + *group = grp->bb_group;
> > > + ac->ac_last_optimal_group = *group;
> > > + ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
> > > + }
> > > + return 0;
> > > +}
> > > +
> > > +/*
> > > + * Choose next group by traversing average fragment size tree. Return 0 if next
> > > + * group was selected optimally. Return 1 if next group could not selected
> > > + * optimally (due to lock contention). Updates *new_cr if cr lvel needs an
> > > + * update.
> > > + */
> > > +static int ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> > > + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> > > +{
> > > + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> > > + int avg_fragment_size, best_so_far;
> > > + struct rb_node *node, *found;
> > > + struct ext4_group_info *grp;
> > > +
> > > + /*
> > > + * If there is contention on the lock, instead of waiting for the lock
> > > + * to become available, just continue searching lineraly. We'll resume
> > > + * our rb tree search later starting at ac->ac_last_optimal_group.
> > > + */
> > > + if (!read_trylock(&sbi->s_mb_rb_lock))
> > > + return 1;
> > > +
> > > + if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
> > > + if (sbi->s_mb_stats)
> > > + atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
> > > + /* We have found something at CR 1 in the past */
> > > + grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
> > > + for (found = rb_next(&grp->bb_avg_fragment_size_rb); found != NULL;
> > > + found = rb_next(found)) {
> > > + grp = rb_entry(found, struct ext4_group_info,
> > > + bb_avg_fragment_size_rb);
> > > + if (sbi->s_mb_stats)
> > > + atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
> > > + if (likely(ext4_mb_good_group(ac, grp->bb_group, 1)))
> > > + break;
> > > + }
> > > +
> > > + goto done;
> > > + }
> > > +
> > > + node = sbi->s_mb_avg_fragment_size_root.rb_node;
> > > + best_so_far = 0;
> > > + found = NULL;
> > > +
> > > + while (node) {
> > > + grp = rb_entry(node, struct ext4_group_info,
> > > + bb_avg_fragment_size_rb);
> > > + avg_fragment_size = 0;
> > > + /*
> > > + * Perform this check without locking, we'll lock later to confirm.
> > > + */
> > > + if (ext4_mb_good_group(ac, grp->bb_group, 1)) {
> > > + avg_fragment_size = grp->bb_fragments ?
> > > + grp->bb_free / grp->bb_fragments : 0;
> > > + if (!best_so_far || avg_fragment_size < best_so_far) {
> > > + best_so_far = avg_fragment_size;
> > > + found = node;
> > > + }
> > > + }
> > > + if (avg_fragment_size > ac->ac_g_ex.fe_len)
> > > + node = node->rb_right;
> > > + else
> > > + node = node->rb_left;
> > > + }
> > > +
> > > +done:
> > > + if (found) {
> > > + grp = rb_entry(found, struct ext4_group_info,
> > > + bb_avg_fragment_size_rb);
> > > + *group = grp->bb_group;
> > > + ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
> > > + } else {
> > > + *new_cr = 2;
> > > + }
> > > +
> > > + read_unlock(&sbi->s_mb_rb_lock);
> > > + ac->ac_last_optimal_group = *group;
> > > + return 0;
> > > +}
> > > +
> > > +/*
> > > + * ext4_mb_choose_next_group: choose next group for allocation.
> > > + *
> > > + * @ac Allocation Context
> > > + * @new_cr This is an output parameter. If the there is no good group
> > > + * available at current CR level, this field is updated to indicate
> > > + * the new cr level that should be used.
> > > + * @group This is an input / output parameter. As an input it indicates the
> > > + * last group used for allocation. As output, this field indicates
> > > + * the next group that should be used.
> > > + * @ngroups Total number of groups
> > > + */
> > > +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> > > + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> > > +{
> > > + int ret;
> > > +
> > > + *new_cr = ac->ac_criteria;
> > > +
> > > + if (!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN) ||
> > > + *new_cr >= 2 ||
> > > + !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
> > > + goto inc_and_return;
> > > +
> > > + if (ac->ac_groups_linear_remaining) {
> > > + ac->ac_groups_linear_remaining--;
> > > + goto inc_and_return;
> > > + }
> > > +
> > > + if (*new_cr == 0) {
> > > + ret = ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
> > > + if (ret)
> > > + goto inc_and_return;
> > > + }
> > > + if (*new_cr == 1) {
> > > + ret = ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
> > > + if (ret)
> > > + goto inc_and_return;
> > > + }
> > > + return;
> > > +
> > > +inc_and_return:
> > > + /*
> > > + * Artificially restricted ngroups for non-extent
> > > + * files makes group > ngroups possible on first loop.
> > > + */
> > > + *group = *group + 1;
> > > + if (*group >= ngroups)
> > > + *group = 0;
> > > +}
> > > +
> > > /*
> > > * Cache the order of the largest free extent we have available in this block
> > > * group.
> > > @@ -751,18 +1053,33 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> > > static void
> > > mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
> > > {
> > > + struct ext4_sb_info *sbi = EXT4_SB(sb);
> > > int i;
> > > - int bits;
> > >
> > > + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
> > > + write_lock(&sbi->s_mb_largest_free_orders_locks[
> > > + grp->bb_largest_free_order]);
> > > + list_del_init(&grp->bb_largest_free_order_node);
> > > + write_unlock(&sbi->s_mb_largest_free_orders_locks[
> > > + grp->bb_largest_free_order]);
> > > + }
> > > grp->bb_largest_free_order = -1; /* uninit */
> > >
> > > - bits = MB_NUM_ORDERS(sb) - 1;
> > > - for (i = bits; i >= 0; i--) {
> > > + for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
> > > if (grp->bb_counters[i] > 0) {
> > > grp->bb_largest_free_order = i;
> > > break;
> > > }
> > > }
> > > + if (test_opt2(sb, MB_OPTIMIZE_SCAN) &&
> > > + grp->bb_largest_free_order >= 0 && grp->bb_free) {
> > > + write_lock(&sbi->s_mb_largest_free_orders_locks[
> > > + grp->bb_largest_free_order]);
> > > + list_add_tail(&grp->bb_largest_free_order_node,
> > > + &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> > > + write_unlock(&sbi->s_mb_largest_free_orders_locks[
> > > + grp->bb_largest_free_order]);
> > > + }
> > > }
> > >
> > > static noinline_for_stack
> > > @@ -818,6 +1135,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
> > > period = get_cycles() - period;
> > > atomic_inc(&sbi->s_mb_buddies_generated);
> > > atomic64_add(period, &sbi->s_mb_generation_time);
> > > + mb_update_avg_fragment_size(sb, grp);
> > > }
> > >
> > > /* The buddy information is attached the buddy cache inode
> > > @@ -1517,6 +1835,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
> > >
> > > done:
> > > mb_set_largest_free_order(sb, e4b->bd_info);
> > > + mb_update_avg_fragment_size(sb, e4b->bd_info);
> > > mb_check_buddy(e4b);
> > > }
> > >
> > > @@ -1653,6 +1972,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
> > > }
> > > mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
> > >
> > > + mb_update_avg_fragment_size(e4b->bd_sb, e4b->bd_info);
> > > ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
> > > mb_check_buddy(e4b);
> > >
> > > @@ -2347,17 +2667,21 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> > > * from the goal value specified
> > > */
> > > group = ac->ac_g_ex.fe_group;
> > > + ac->ac_last_optimal_group = group;
> > > + ac->ac_groups_linear_remaining = sbi->s_mb_linear_limit;
> > > prefetch_grp = group;
> > >
> > > - for (i = 0; i < ngroups; group++, i++) {
> > > - int ret = 0;
> > > + for (i = 0; i < ngroups; i++) {
> > > + int ret = 0, new_cr;
> > > +
> > > cond_resched();
> > > - /*
> > > - * Artificially restricted ngroups for non-extent
> > > - * files makes group > ngroups possible on first loop.
> > > - */
> > > - if (group >= ngroups)
> > > - group = 0;
> > > +
> > > + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
> > > +
> > > + if (new_cr != cr) {
> > > + cr = new_cr;
> > > + goto repeat;
> > > + }
> > >
> > > /*
> > > * Batch reads of the block allocation bitmaps
> > > @@ -2578,6 +2902,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
> > > atomic64_read(&sbi->s_bal_cX_groups_considered[0]));
> > > seq_printf(seq, "\t\tuseless_loops: %llu\n",
> > > atomic64_read(&sbi->s_bal_cX_failed[0]));
> > > + seq_printf(seq, "\t\tbad_suggestions: %u\n",
> > > + atomic_read(&sbi->s_bal_cr0_bad_suggestions));
> > >
> > > seq_puts(seq, "\tcr1_stats:\n");
> > > seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[1]));
> > > @@ -2585,6 +2911,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
> > > atomic64_read(&sbi->s_bal_cX_groups_considered[1]));
> > > seq_printf(seq, "\t\tuseless_loops: %llu\n",
> > > atomic64_read(&sbi->s_bal_cX_failed[1]));
> > > + seq_printf(seq, "\t\tbad_suggestions: %u\n",
> > > + atomic_read(&sbi->s_bal_cr1_bad_suggestions));
> > >
> > > seq_puts(seq, "\tcr2_stats:\n");
> > > seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[2]));
> > > @@ -2719,7 +3047,10 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
> > > INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
> > > init_rwsem(&meta_group_info[i]->alloc_sem);
> > > meta_group_info[i]->bb_free_root = RB_ROOT;
> > > + INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
> > > + RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb);
> > > meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
> > > + meta_group_info[i]->bb_group = group;
> > >
> > > mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
> > > return 0;
> > > @@ -2909,6 +3240,26 @@ int ext4_mb_init(struct super_block *sb)
> > > i++;
> > > } while (i < MB_NUM_ORDERS(sb));
> > >
> > > + sbi->s_mb_avg_fragment_size_root = RB_ROOT;
> > > + sbi->s_mb_largest_free_orders =
> > > + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> > > + GFP_KERNEL);
> > > + if (!sbi->s_mb_largest_free_orders) {
> > > + ret = -ENOMEM;
> > > + goto out;
> > > + }
> > > + sbi->s_mb_largest_free_orders_locks =
> > > + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> > > + GFP_KERNEL);
> > > + if (!sbi->s_mb_largest_free_orders_locks) {
> > > + ret = -ENOMEM;
> > > + goto out;
> > > + }
> > > + for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> > > + INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
> > > + rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
> > > + }
> > > + rwlock_init(&sbi->s_mb_rb_lock);
> > >
> > > spin_lock_init(&sbi->s_md_lock);
> > > sbi->s_mb_free_pending = 0;
> > > @@ -2961,6 +3312,10 @@ int ext4_mb_init(struct super_block *sb)
> > > spin_lock_init(&lg->lg_prealloc_lock);
> > > }
> > >
> > > + if (blk_queue_nonrot(bdev_get_queue(sb->s_bdev)))
> > > + sbi->s_mb_linear_limit = 0;
> > > + else
> > > + sbi->s_mb_linear_limit = MB_DEFAULT_LINEAR_LIMIT;
> > > /* init file for buddy data */
> > > ret = ext4_mb_init_backend(sb);
> > > if (ret != 0)
> > > @@ -2972,6 +3327,8 @@ int ext4_mb_init(struct super_block *sb)
> > > free_percpu(sbi->s_locality_groups);
> > > sbi->s_locality_groups = NULL;
> > > out:
> > > + kfree(sbi->s_mb_largest_free_orders);
> > > + kfree(sbi->s_mb_largest_free_orders_locks);
> > > kfree(sbi->s_mb_offsets);
> > > sbi->s_mb_offsets = NULL;
> > > kfree(sbi->s_mb_maxs);
> > > @@ -3028,6 +3385,8 @@ int ext4_mb_release(struct super_block *sb)
> > > kvfree(group_info);
> > > rcu_read_unlock();
> > > }
> > > + kfree(sbi->s_mb_largest_free_orders);
> > > + kfree(sbi->s_mb_largest_free_orders_locks);
> > > kfree(sbi->s_mb_offsets);
> > > kfree(sbi->s_mb_maxs);
> > > iput(sbi->s_buddy_cache);
> > > diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> > > index 68111a10cfee..02585e3cbcad 100644
> > > --- a/fs/ext4/mballoc.h
> > > +++ b/fs/ext4/mballoc.h
> > > @@ -78,6 +78,18 @@
> > > */
> > > #define MB_DEFAULT_MAX_INODE_PREALLOC 512
> > >
> > > +/*
> > > + * Number of groups to search linearly before performing group scanning
> > > + * optimization.
> > > + */
> > > +#define MB_DEFAULT_LINEAR_LIMIT 4
> > > +
> > > +/*
> > > + * Minimum number of groups that should be present in the file system to perform
> > > + * group scanning optimizations.
> > > + */
> > > +#define MB_DEFAULT_LINEAR_SCAN_THRESHOLD 16
> > > +
> > > /*
> > > * Number of valid buddy orders
> > > */
> > > @@ -166,11 +178,14 @@ struct ext4_allocation_context {
> > > /* copy of the best found extent taken before preallocation efforts */
> > > struct ext4_free_extent ac_f_ex;
> > >
> > > + ext4_group_t ac_last_optimal_group;
> > > + __u32 ac_groups_considered;
> > > + __u32 ac_flags; /* allocation hints */
> > > __u16 ac_groups_scanned;
> > > + __u16 ac_groups_linear_remaining;
> > > __u16 ac_found;
> > > __u16 ac_tail;
> > > __u16 ac_buddy;
> > > - __u16 ac_flags; /* allocation hints */
> > > __u8 ac_status;
> > > __u8 ac_criteria;
> > > __u8 ac_2order; /* if request is to allocate 2^N blocks and
> > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > > index 3a2cd9fb7e73..6116640081c0 100644
> > > --- a/fs/ext4/super.c
> > > +++ b/fs/ext4/super.c
> > > @@ -1687,7 +1687,7 @@ enum {
> > > Opt_dioread_nolock, Opt_dioread_lock,
> > > Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
> > > Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
> > > - Opt_prefetch_block_bitmaps,
> > > + Opt_prefetch_block_bitmaps, Opt_mb_optimize_scan,
> > > #ifdef CONFIG_EXT4_DEBUG
> > > Opt_fc_debug_max_replay, Opt_fc_debug_force
> > > #endif
> > > @@ -1788,6 +1788,7 @@ static const match_table_t tokens = {
> > > {Opt_nombcache, "nombcache"},
> > > {Opt_nombcache, "no_mbcache"}, /* for backward compatibility */
> > > {Opt_prefetch_block_bitmaps, "prefetch_block_bitmaps"},
> > > + {Opt_mb_optimize_scan, "mb_optimize_scan=%d"},
> > > {Opt_removed, "check=none"}, /* mount option from ext2/3 */
> > > {Opt_removed, "nocheck"}, /* mount option from ext2/3 */
> > > {Opt_removed, "reservation"}, /* mount option from ext2/3 */
> > > @@ -1820,6 +1821,8 @@ static ext4_fsblk_t get_sb_block(void **data)
> > > }
> > >
> > > #define DEFAULT_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
> > > +#define DEFAULT_MB_OPTIMIZE_SCAN (-1)
> > > +
> > > static const char deprecated_msg[] =
> > > "Mount option \"%s\" will be removed by %s\n"
> > > "Contact [email protected] if you think we should keep it.\n";
> > > @@ -2008,6 +2011,7 @@ static const struct mount_opts {
> > > {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
> > > {Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
> > > MOPT_SET},
> > > + {Opt_mb_optimize_scan, EXT4_MOUNT2_MB_OPTIMIZE_SCAN, MOPT_GTE0},
> > > #ifdef CONFIG_EXT4_DEBUG
> > > {Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
> > > MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
> > > @@ -2092,6 +2096,7 @@ static int ext4_set_test_dummy_encryption(struct super_block *sb,
> > > struct ext4_parsed_options {
> > > unsigned long journal_devnum;
> > > unsigned int journal_ioprio;
> > > + int mb_optimize_scan;
> > > };
> > >
> > > static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> > > @@ -2388,6 +2393,13 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> > > sbi->s_mount_opt |= m->mount_opt;
> > > } else if (token == Opt_data_err_ignore) {
> > > sbi->s_mount_opt &= ~m->mount_opt;
> > > + } else if (token == Opt_mb_optimize_scan) {
> > > + if (arg != 0 && arg != 1) {
> > > + ext4_msg(sb, KERN_WARNING,
> > > + "mb_optimize_scan should be set to 0 or 1.");
> > > + return -1;
> > > + }
> > > + parsed_opts->mb_optimize_scan = arg;
> > > } else {
> > > if (!args->from)
> > > arg = 1;
> > > @@ -4034,6 +4046,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> > > /* Set defaults for the variables that will be set during parsing */
> > > parsed_opts.journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
> > > parsed_opts.journal_devnum = 0;
> > > + parsed_opts.mb_optimize_scan = DEFAULT_MB_OPTIMIZE_SCAN;
> > >
> > > if ((data && !orig_data) || !sbi)
> > > goto out_free_base;
> > > @@ -4984,6 +4997,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> > > ext4_fc_replay_cleanup(sb);
> > >
> > > ext4_ext_init(sb);
> > > +
> > > + /*
> > > + * Enable optimize_scan if number of groups is > threshold. This can be
> > > + * turned off by passing "mb_optimize_scan=0". This can also be
> > > + * turned on forcefully by passing "mb_optimize_scan=1".
> > > + */
> > > + if (parsed_opts.mb_optimize_scan == 1)
> > > + set_opt2(sb, MB_OPTIMIZE_SCAN);
> > > + else if (parsed_opts.mb_optimize_scan == 0)
> > > + clear_opt2(sb, MB_OPTIMIZE_SCAN);
> > > + else if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD)
> > > + set_opt2(sb, MB_OPTIMIZE_SCAN);
> > > +
> > > err = ext4_mb_init(sb);
> > > if (err) {
> > > ext4_msg(sb, KERN_ERR, "failed to initialize mballoc (%d)",
> > > diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> > > index 59ca9d73b42f..16b8a838f631 100644
> > > --- a/fs/ext4/sysfs.c
> > > +++ b/fs/ext4/sysfs.c
> > > @@ -213,6 +213,7 @@ EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
> > > EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
> > > EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
> > > EXT4_RW_ATTR_SBI_UI(mb_max_inode_prealloc, s_mb_max_inode_prealloc);
> > > +EXT4_RW_ATTR_SBI_UI(mb_linear_limit, s_mb_linear_limit);
> > > EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
> > > EXT4_ATTR(trigger_fs_error, 0200, trigger_test_error);
> > > EXT4_RW_ATTR_SBI_UI(err_ratelimit_interval_ms, s_err_ratelimit_state.interval);
> > > @@ -260,6 +261,7 @@ static struct attribute *ext4_attrs[] = {
> > > ATTR_LIST(mb_stream_req),
> > > ATTR_LIST(mb_group_prealloc),
> > > ATTR_LIST(mb_max_inode_prealloc),
> > > + ATTR_LIST(mb_linear_limit),
> > > ATTR_LIST(max_writeback_mb_bump),
> > > ATTR_LIST(extent_max_zeroout_kb),
> > > ATTR_LIST(trigger_fs_error),
> > > --
> > > 2.31.0.291.g576ba9dcdaf-goog
> > >
> >
> >
> > Cheers, Andreas
> >
> >
> >
> >
> >
On Mar 29, 2021, at 17:03, harshad shirwadkar <[email protected]> wrote:
>
> Another problem that I found in the patch was that when all groups are
> not loaded, the optimization concludes that CR0 and CR1 cannot be used
> for allocation (since these structures aren't populated). The problem
> goes away if we also mount with prefetch_block_bitmaps mount option.
> But I think if all the groups are not loaded, we should at least fall
> back to linear search at CR1. Maybe skipping CR0 is okay.
Alex's patch to skip cr0/1 if no groups are loaded is OK, as long as there is
some group prefetch is triggered in the background, IMHO, and is there
for later allocations.
That avoids lots of scanning at mount time if the filesystem is mostly full,
and the first allocation is not critical as long as this triggers group reads.
Do you have any idea what the performance impact of this is (eg.
latency to first write)? Ideally there would be some group prefetch done
right at mount time (itable zero thread?) so that if writes are not
done *right* after mount there would not be any noticeable latency.
Cheers, Andreas
>> On Mon, Mar 29, 2021 at 12:32 PM harshad shirwadkar
>> <[email protected]> wrote:
>>
>> Okay sounds good. I'll do that!
>>
>>> On Mon, Mar 29, 2021 at 12:14 PM Andreas Dilger <[email protected]> wrote:
>>>
>>> On Mar 24, 2021, at 5:19 PM, Harshad Shirwadkar <[email protected]> wrote:
>>>>
>>>> Instead of traversing through groups linearly, scan groups in specific
>>>> orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
>>>> largest free order >= the order of the request. So, with this patch,
>>>> we maintain lists for each possible order and insert each group into a
>>>> list based on the largest free order in its buddy bitmap. During cr 0
>>>> allocation, we traverse these lists in the increasing order of largest
>>>> free orders. This allows us to find a group with the best available cr
>>>> 0 match in constant time. If nothing can be found, we fallback to cr 1
>>>> immediately.
>>>>
>>>> At CR1, the story is slightly different. We want to traverse in the
>>>> order of increasing average fragment size. For CR1, we maintain a rb
>>>> tree of groupinfos which is sorted by average fragment size. Instead
>>>> of traversing linearly, at CR1, we traverse in the order of increasing
>>>> average fragment size, starting at the most optimal group. This brings
>>>> down cr 1 search complexity to log(num groups).
>>>>
>>>> For cr >= 2, we just perform the linear search as before. Also, in
>>>> case of lock contention, we intermittently fallback to linear search
>>>> even in CR 0 and CR 1 cases. This allows us to proceed during the
>>>> allocation path even in case of high contention.
>>>
>>> If you are refreshing this patch anyway, could you rename "mb_linear_limit"
>>> to "mb_linear_groups" or "mb_max_linear_groups" or similar? It otherwise
>>> isn't clear what units the "limit" is in (could be MB/GB or something else).
>>>
>>> Cheers, Andreas
>>>
>>>> There is an opportunity to do optimization at CR2 too. That's because
>>>> at CR2 we only consider groups where bb_free counter (number of free
>>>> blocks) is greater than the request extent size. That's left as future
>>>> work.
>>>>
>>>> All the changes introduced in this patch are protected under a new
>>>> mount option "mb_optimize_scan".
>>>>
>>>> With this patchset, following experiment was performed:
>>>>
>>>> Created a highly fragmented disk of size 65TB. The disk had no
>>>> contiguous 2M regions. Following command was run consecutively for 3
>>>> times:
>>>>
>>>> time dd if=/dev/urandom of=file bs=2M count=10
>>>>
>>>> Here are the results with and without cr 0/1 optimizations introduced
>>>> in this patch:
>>>>
>>>> |---------+------------------------------+---------------------------|
>>>> | | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
>>>> |---------+------------------------------+---------------------------|
>>>> | 1st run | 5m1.871s | 2m47.642s |
>>>> | 2nd run | 2m28.390s | 0m0.611s |
>>>> | 3rd run | 2m26.530s | 0m1.255s |
>>>> |---------+------------------------------+---------------------------|
>>>>
>>>> Signed-off-by: Harshad Shirwadkar <[email protected]>
>>>> Reported-by: kernel test robot <[email protected]>
>>>> Reported-by: Dan Carpenter <[email protected]>
>>>> Reviewed-by: Andreas Dilger <[email protected]>
>>>> ---
>>>> fs/ext4/ext4.h | 19 ++-
>>>> fs/ext4/mballoc.c | 381 ++++++++++++++++++++++++++++++++++++++++++++--
>>>> fs/ext4/mballoc.h | 17 ++-
>>>> fs/ext4/super.c | 28 +++-
>>>> fs/ext4/sysfs.c | 2 +
>>>> 5 files changed, 432 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>> index 85eeeba3bca3..5930c8cb5159 100644
>>>> --- a/fs/ext4/ext4.h
>>>> +++ b/fs/ext4/ext4.h
>>>> @@ -162,7 +162,10 @@ enum SHIFT_DIRECTION {
>>>> #define EXT4_MB_USE_RESERVED 0x2000
>>>> /* Do strict check for free blocks while retrying block allocation */
>>>> #define EXT4_MB_STRICT_CHECK 0x4000
>>>> -
>>>> +/* Large fragment size list lookup succeeded at least once for cr = 0 */
>>>> +#define EXT4_MB_CR0_OPTIMIZED 0x8000
>>>> +/* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
>>>> +#define EXT4_MB_CR1_OPTIMIZED 0x00010000
>>>> struct ext4_allocation_request {
>>>> /* target inode for block we're allocating */
>>>> struct inode *inode;
>>>> @@ -1247,7 +1250,9 @@ struct ext4_inode_info {
>>>> #define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */
>>>> #define EXT4_MOUNT2_DAX_NEVER 0x00000020 /* Do not allow Direct Access */
>>>> #define EXT4_MOUNT2_DAX_INODE 0x00000040 /* For printing options only */
>>>> -
>>>> +#define EXT4_MOUNT2_MB_OPTIMIZE_SCAN 0x00000080 /* Optimize group
>>>> + * scanning in mballoc
>>>> + */
>>>>
>>>> #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
>>>> ~EXT4_MOUNT_##opt
>>>> @@ -1527,9 +1532,14 @@ struct ext4_sb_info {
>>>> unsigned int s_mb_free_pending;
>>>> struct list_head s_freed_data_list; /* List of blocks to be freed
>>>> after commit completed */
>>>> + struct rb_root s_mb_avg_fragment_size_root;
>>>> + rwlock_t s_mb_rb_lock;
>>>> + struct list_head *s_mb_largest_free_orders;
>>>> + rwlock_t *s_mb_largest_free_orders_locks;
>>>>
>>>> /* tunables */
>>>> unsigned long s_stripe;
>>>> + unsigned int s_mb_linear_limit;
>>>> unsigned int s_mb_stream_request;
>>>> unsigned int s_mb_max_to_scan;
>>>> unsigned int s_mb_min_to_scan;
>>>> @@ -1553,6 +1563,8 @@ struct ext4_sb_info {
>>>> atomic_t s_bal_goals; /* goal hits */
>>>> atomic_t s_bal_breaks; /* too long searches */
>>>> atomic_t s_bal_2orders; /* 2^order hits */
>>>> + atomic_t s_bal_cr0_bad_suggestions;
>>>> + atomic_t s_bal_cr1_bad_suggestions;
>>>> atomic64_t s_bal_cX_groups_considered[4];
>>>> atomic64_t s_bal_cX_hits[4];
>>>> atomic64_t s_bal_cX_failed[4]; /* cX loop didn't find blocks */
>>>> @@ -3309,11 +3321,14 @@ struct ext4_group_info {
>>>> ext4_grpblk_t bb_free; /* total free blocks */
>>>> ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
>>>> ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */
>>>> + ext4_group_t bb_group; /* Group number */
>>>> struct list_head bb_prealloc_list;
>>>> #ifdef DOUBLE_CHECK
>>>> void *bb_bitmap;
>>>> #endif
>>>> struct rw_semaphore alloc_sem;
>>>> + struct rb_node bb_avg_fragment_size_rb;
>>>> + struct list_head bb_largest_free_order_node;
>>>> ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
>>>> * regions, index is order.
>>>> * bb_counters[3] = 5 means
>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>> index 15127d815461..cbf9a89c0ef5 100644
>>>> --- a/fs/ext4/mballoc.c
>>>> +++ b/fs/ext4/mballoc.c
>>>> @@ -127,11 +127,50 @@
>>>> * smallest multiple of the stripe value (sbi->s_stripe) which is
>>>> * greater than the default mb_group_prealloc.
>>>> *
>>>> + * If "mb_optimize_scan" mount option is set, we maintain in memory group info
>>>> + * structures in two data structures:
>>>> + *
>>>> + * 1) Array of largest free order lists (sbi->s_mb_largest_free_orders)
>>>> + *
>>>> + * Locking: sbi->s_mb_largest_free_orders_locks(array of rw locks)
>>>> + *
>>>> + * This is an array of lists where the index in the array represents the
>>>> + * largest free order in the buddy bitmap of the participating group infos of
>>>> + * that list. So, there are exactly MB_NUM_ORDERS(sb) (which means total
>>>> + * number of buddy bitmap orders possible) number of lists. Group-infos are
>>>> + * placed in appropriate lists.
>>>> + *
>>>> + * 2) Average fragment size rb tree (sbi->s_mb_avg_fragment_size_root)
>>>> + *
>>>> + * Locking: sbi->s_mb_rb_lock (rwlock)
>>>> + *
>>>> + * This is a red black tree consisting of group infos and the tree is sorted
>>>> + * by average fragment sizes (which is calculated as ext4_group_info->bb_free
>>>> + * / ext4_group_info->bb_fragments).
>>>> + *
>>>> + * When "mb_optimize_scan" mount option is set, mballoc consults the above data
>>>> + * structures to decide the order in which groups are to be traversed for
>>>> + * fulfilling an allocation request.
>>>> + *
>>>> + * At CR = 0, we look for groups which have the largest_free_order >= the order
>>>> + * of the request. We directly look at the largest free order list in the data
>>>> + * structure (1) above where largest_free_order = order of the request. If that
>>>> + * list is empty, we look at remaining list in the increasing order of
>>>> + * largest_free_order. This allows us to perform CR = 0 lookup in O(1) time.
>>>> + *
>>>> + * At CR = 1, we only consider groups where average fragment size > request
>>>> + * size. So, we lookup a group which has average fragment size just above or
>>>> + * equal to request size using our rb tree (data structure 2) in O(log N) time.
>>>> + *
>>>> + * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
>>>> + * linear order which requires O(N) search time for each CR 0 and CR 1 phase.
>>>> + *
>>>> * The regular allocator (using the buddy cache) supports a few tunables.
>>>> *
>>>> * /sys/fs/ext4/<partition>/mb_min_to_scan
>>>> * /sys/fs/ext4/<partition>/mb_max_to_scan
>>>> * /sys/fs/ext4/<partition>/mb_order2_req
>>>> + * /sys/fs/ext4/<partition>/mb_linear_limit
>>>> *
>>>> * The regular allocator uses buddy scan only if the request len is power of
>>>> * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The
>>>> @@ -149,6 +188,16 @@
>>>> * can be used for allocation. ext4_mb_good_group explains how the groups are
>>>> * checked.
>>>> *
>>>> + * When "mb_optimize_scan" is turned on, as mentioned above, the groups may not
>>>> + * get traversed linearly. That may result in subsequent allocations being not
>>>> + * close to each other. And so, the underlying device may get filled up in a
>>>> + * non-linear fashion. While that may not matter on non-rotational devices, for
>>>> + * rotational devices that may result in higher seek times. "mb_linear_limit"
>>>> + * tells mballoc how many groups mballoc should search linearly before
>>>> + * performing consulting above data structures for more efficient lookups. For
>>>> + * non rotational devices, this value defaults to 0 and for rotational devices
>>>> + * this is set to MB_DEFAULT_LINEAR_LIMIT.
>>>> + *
>>>> * Both the prealloc space are getting populated as above. So for the first
>>>> * request we will hit the buddy cache which will result in this prealloc
>>>> * space getting filled. The prealloc space is then later used for the
>>>> @@ -299,6 +348,8 @@
>>>> * - bitlock on a group (group)
>>>> * - object (inode/locality) (object)
>>>> * - per-pa lock (pa)
>>>> + * - cr0 lists lock (cr0)
>>>> + * - cr1 tree lock (cr1)
>>>> *
>>>> * Paths:
>>>> * - new pa
>>>> @@ -328,6 +379,9 @@
>>>> * group
>>>> * object
>>>> *
>>>> + * - allocation path (ext4_mb_regular_allocator)
>>>> + * group
>>>> + * cr0/cr1
>>>> */
>>>> static struct kmem_cache *ext4_pspace_cachep;
>>>> static struct kmem_cache *ext4_ac_cachep;
>>>> @@ -351,6 +405,9 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
>>>> ext4_group_t group);
>>>> static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
>>>>
>>>> +static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
>>>> + ext4_group_t group, int cr);
>>>> +
>>>> /*
>>>> * The algorithm using this percpu seq counter goes below:
>>>> * 1. We sample the percpu discard_pa_seq counter before trying for block
>>>> @@ -744,6 +801,251 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
>>>> }
>>>> }
>>>>
>>>> +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
>>>> + int (*cmp)(struct rb_node *, struct rb_node *))
>>>> +{
>>>> + struct rb_node **iter = &root->rb_node, *parent = NULL;
>>>> +
>>>> + while (*iter) {
>>>> + parent = *iter;
>>>> + if (cmp(new, *iter) > 0)
>>>> + iter = &((*iter)->rb_left);
>>>> + else
>>>> + iter = &((*iter)->rb_right);
>>>> + }
>>>> +
>>>> + rb_link_node(new, parent, iter);
>>>> + rb_insert_color(new, root);
>>>> +}
>>>> +
>>>> +static int
>>>> +ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
>>>> +{
>>>> + struct ext4_group_info *grp1 = rb_entry(rb1,
>>>> + struct ext4_group_info,
>>>> + bb_avg_fragment_size_rb);
>>>> + struct ext4_group_info *grp2 = rb_entry(rb2,
>>>> + struct ext4_group_info,
>>>> + bb_avg_fragment_size_rb);
>>>> + int num_frags_1, num_frags_2;
>>>> +
>>>> + num_frags_1 = grp1->bb_fragments ?
>>>> + grp1->bb_free / grp1->bb_fragments : 0;
>>>> + num_frags_2 = grp2->bb_fragments ?
>>>> + grp2->bb_free / grp2->bb_fragments : 0;
>>>> +
>>>> + return (num_frags_2 - num_frags_1);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Reinsert grpinfo into the avg_fragment_size tree with new average
>>>> + * fragment size.
>>>> + */
>>>> +static void
>>>> +mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
>>>> +{
>>>> + struct ext4_sb_info *sbi = EXT4_SB(sb);
>>>> +
>>>> + if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
>>>> + return;
>>>> +
>>>> + write_lock(&sbi->s_mb_rb_lock);
>>>> + if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
>>>> + rb_erase(&grp->bb_avg_fragment_size_rb,
>>>> + &sbi->s_mb_avg_fragment_size_root);
>>>> + RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
>>>> + }
>>>> +
>>>> + ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
>>>> + &grp->bb_avg_fragment_size_rb,
>>>> + ext4_mb_avg_fragment_size_cmp);
>>>> + write_unlock(&sbi->s_mb_rb_lock);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Choose next group by traversing largest_free_order lists. Return 0 if next
>>>> + * group was selected optimally. Return 1 if next group was not selected
>>>> + * optimally. Updates *new_cr if cr level needs an update.
>>>> + */
>>>> +static int ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
>>>> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
>>>> +{
>>>> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
>>>> + struct ext4_group_info *iter, *grp;
>>>> + int i;
>>>> +
>>>> + if (ac->ac_status == AC_STATUS_FOUND)
>>>> + return 1;
>>>> +
>>>> + if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
>>>> + atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
>>>> +
>>>> + grp = NULL;
>>>> + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
>>>> + if (list_empty(&sbi->s_mb_largest_free_orders[i]))
>>>> + continue;
>>>> + read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
>>>> + if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
>>>> + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
>>>> + continue;
>>>> + }
>>>> + grp = NULL;
>>>> + list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
>>>> + bb_largest_free_order_node) {
>>>> + if (sbi->s_mb_stats)
>>>> + atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
>>>> + if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
>>>> + grp = iter;
>>>> + break;
>>>> + }
>>>> + }
>>>> + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
>>>> + if (grp)
>>>> + break;
>>>> + }
>>>> +
>>>> + if (!grp) {
>>>> + /* Increment cr and search again */
>>>> + *new_cr = 1;
>>>> + } else {
>>>> + *group = grp->bb_group;
>>>> + ac->ac_last_optimal_group = *group;
>>>> + ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
>>>> + }
>>>> + return 0;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Choose next group by traversing average fragment size tree. Return 0 if next
>>>> + * group was selected optimally. Return 1 if next group could not selected
>>>> + * optimally (due to lock contention). Updates *new_cr if cr lvel needs an
>>>> + * update.
>>>> + */
>>>> +static int ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
>>>> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
>>>> +{
>>>> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
>>>> + int avg_fragment_size, best_so_far;
>>>> + struct rb_node *node, *found;
>>>> + struct ext4_group_info *grp;
>>>> +
>>>> + /*
>>>> + * If there is contention on the lock, instead of waiting for the lock
>>>> + * to become available, just continue searching lineraly. We'll resume
>>>> + * our rb tree search later starting at ac->ac_last_optimal_group.
>>>> + */
>>>> + if (!read_trylock(&sbi->s_mb_rb_lock))
>>>> + return 1;
>>>> +
>>>> + if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
>>>> + if (sbi->s_mb_stats)
>>>> + atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
>>>> + /* We have found something at CR 1 in the past */
>>>> + grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
>>>> + for (found = rb_next(&grp->bb_avg_fragment_size_rb); found != NULL;
>>>> + found = rb_next(found)) {
>>>> + grp = rb_entry(found, struct ext4_group_info,
>>>> + bb_avg_fragment_size_rb);
>>>> + if (sbi->s_mb_stats)
>>>> + atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
>>>> + if (likely(ext4_mb_good_group(ac, grp->bb_group, 1)))
>>>> + break;
>>>> + }
>>>> +
>>>> + goto done;
>>>> + }
>>>> +
>>>> + node = sbi->s_mb_avg_fragment_size_root.rb_node;
>>>> + best_so_far = 0;
>>>> + found = NULL;
>>>> +
>>>> + while (node) {
>>>> + grp = rb_entry(node, struct ext4_group_info,
>>>> + bb_avg_fragment_size_rb);
>>>> + avg_fragment_size = 0;
>>>> + /*
>>>> + * Perform this check without locking, we'll lock later to confirm.
>>>> + */
>>>> + if (ext4_mb_good_group(ac, grp->bb_group, 1)) {
>>>> + avg_fragment_size = grp->bb_fragments ?
>>>> + grp->bb_free / grp->bb_fragments : 0;
>>>> + if (!best_so_far || avg_fragment_size < best_so_far) {
>>>> + best_so_far = avg_fragment_size;
>>>> + found = node;
>>>> + }
>>>> + }
>>>> + if (avg_fragment_size > ac->ac_g_ex.fe_len)
>>>> + node = node->rb_right;
>>>> + else
>>>> + node = node->rb_left;
>>>> + }
>>>> +
>>>> +done:
>>>> + if (found) {
>>>> + grp = rb_entry(found, struct ext4_group_info,
>>>> + bb_avg_fragment_size_rb);
>>>> + *group = grp->bb_group;
>>>> + ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
>>>> + } else {
>>>> + *new_cr = 2;
>>>> + }
>>>> +
>>>> + read_unlock(&sbi->s_mb_rb_lock);
>>>> + ac->ac_last_optimal_group = *group;
>>>> + return 0;
>>>> +}
>>>> +
>>>> +/*
>>>> + * ext4_mb_choose_next_group: choose next group for allocation.
>>>> + *
>>>> + * @ac Allocation Context
>>>> + * @new_cr This is an output parameter. If the there is no good group
>>>> + * available at current CR level, this field is updated to indicate
>>>> + * the new cr level that should be used.
>>>> + * @group This is an input / output parameter. As an input it indicates the
>>>> + * last group used for allocation. As output, this field indicates
>>>> + * the next group that should be used.
>>>> + * @ngroups Total number of groups
>>>> + */
>>>> +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
>>>> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
>>>> +{
>>>> + int ret;
>>>> +
>>>> + *new_cr = ac->ac_criteria;
>>>> +
>>>> + if (!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN) ||
>>>> + *new_cr >= 2 ||
>>>> + !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
>>>> + goto inc_and_return;
>>>> +
>>>> + if (ac->ac_groups_linear_remaining) {
>>>> + ac->ac_groups_linear_remaining--;
>>>> + goto inc_and_return;
>>>> + }
>>>> +
>>>> + if (*new_cr == 0) {
>>>> + ret = ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
>>>> + if (ret)
>>>> + goto inc_and_return;
>>>> + }
>>>> + if (*new_cr == 1) {
>>>> + ret = ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
>>>> + if (ret)
>>>> + goto inc_and_return;
>>>> + }
>>>> + return;
>>>> +
>>>> +inc_and_return:
>>>> + /*
>>>> + * Artificially restricted ngroups for non-extent
>>>> + * files makes group > ngroups possible on first loop.
>>>> + */
>>>> + *group = *group + 1;
>>>> + if (*group >= ngroups)
>>>> + *group = 0;
>>>> +}
>>>> +
>>>> /*
>>>> * Cache the order of the largest free extent we have available in this block
>>>> * group.
>>>> @@ -751,18 +1053,33 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
>>>> static void
>>>> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
>>>> {
>>>> + struct ext4_sb_info *sbi = EXT4_SB(sb);
>>>> int i;
>>>> - int bits;
>>>>
>>>> + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
>>>> + write_lock(&sbi->s_mb_largest_free_orders_locks[
>>>> + grp->bb_largest_free_order]);
>>>> + list_del_init(&grp->bb_largest_free_order_node);
>>>> + write_unlock(&sbi->s_mb_largest_free_orders_locks[
>>>> + grp->bb_largest_free_order]);
>>>> + }
>>>> grp->bb_largest_free_order = -1; /* uninit */
>>>>
>>>> - bits = MB_NUM_ORDERS(sb) - 1;
>>>> - for (i = bits; i >= 0; i--) {
>>>> + for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
>>>> if (grp->bb_counters[i] > 0) {
>>>> grp->bb_largest_free_order = i;
>>>> break;
>>>> }
>>>> }
>>>> + if (test_opt2(sb, MB_OPTIMIZE_SCAN) &&
>>>> + grp->bb_largest_free_order >= 0 && grp->bb_free) {
>>>> + write_lock(&sbi->s_mb_largest_free_orders_locks[
>>>> + grp->bb_largest_free_order]);
>>>> + list_add_tail(&grp->bb_largest_free_order_node,
>>>> + &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
>>>> + write_unlock(&sbi->s_mb_largest_free_orders_locks[
>>>> + grp->bb_largest_free_order]);
>>>> + }
>>>> }
>>>>
>>>> static noinline_for_stack
>>>> @@ -818,6 +1135,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
>>>> period = get_cycles() - period;
>>>> atomic_inc(&sbi->s_mb_buddies_generated);
>>>> atomic64_add(period, &sbi->s_mb_generation_time);
>>>> + mb_update_avg_fragment_size(sb, grp);
>>>> }
>>>>
>>>> /* The buddy information is attached the buddy cache inode
>>>> @@ -1517,6 +1835,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
>>>>
>>>> done:
>>>> mb_set_largest_free_order(sb, e4b->bd_info);
>>>> + mb_update_avg_fragment_size(sb, e4b->bd_info);
>>>> mb_check_buddy(e4b);
>>>> }
>>>>
>>>> @@ -1653,6 +1972,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
>>>> }
>>>> mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
>>>>
>>>> + mb_update_avg_fragment_size(e4b->bd_sb, e4b->bd_info);
>>>> ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
>>>> mb_check_buddy(e4b);
>>>>
>>>> @@ -2347,17 +2667,21 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>>>> * from the goal value specified
>>>> */
>>>> group = ac->ac_g_ex.fe_group;
>>>> + ac->ac_last_optimal_group = group;
>>>> + ac->ac_groups_linear_remaining = sbi->s_mb_linear_limit;
>>>> prefetch_grp = group;
>>>>
>>>> - for (i = 0; i < ngroups; group++, i++) {
>>>> - int ret = 0;
>>>> + for (i = 0; i < ngroups; i++) {
>>>> + int ret = 0, new_cr;
>>>> +
>>>> cond_resched();
>>>> - /*
>>>> - * Artificially restricted ngroups for non-extent
>>>> - * files makes group > ngroups possible on first loop.
>>>> - */
>>>> - if (group >= ngroups)
>>>> - group = 0;
>>>> +
>>>> + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
>>>> +
>>>> + if (new_cr != cr) {
>>>> + cr = new_cr;
>>>> + goto repeat;
>>>> + }
>>>>
>>>> /*
>>>> * Batch reads of the block allocation bitmaps
>>>> @@ -2578,6 +2902,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
>>>> atomic64_read(&sbi->s_bal_cX_groups_considered[0]));
>>>> seq_printf(seq, "\t\tuseless_loops: %llu\n",
>>>> atomic64_read(&sbi->s_bal_cX_failed[0]));
>>>> + seq_printf(seq, "\t\tbad_suggestions: %u\n",
>>>> + atomic_read(&sbi->s_bal_cr0_bad_suggestions));
>>>>
>>>> seq_puts(seq, "\tcr1_stats:\n");
>>>> seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[1]));
>>>> @@ -2585,6 +2911,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
>>>> atomic64_read(&sbi->s_bal_cX_groups_considered[1]));
>>>> seq_printf(seq, "\t\tuseless_loops: %llu\n",
>>>> atomic64_read(&sbi->s_bal_cX_failed[1]));
>>>> + seq_printf(seq, "\t\tbad_suggestions: %u\n",
>>>> + atomic_read(&sbi->s_bal_cr1_bad_suggestions));
>>>>
>>>> seq_puts(seq, "\tcr2_stats:\n");
>>>> seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[2]));
>>>> @@ -2719,7 +3047,10 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
>>>> INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
>>>> init_rwsem(&meta_group_info[i]->alloc_sem);
>>>> meta_group_info[i]->bb_free_root = RB_ROOT;
>>>> + INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
>>>> + RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb);
>>>> meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
>>>> + meta_group_info[i]->bb_group = group;
>>>>
>>>> mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
>>>> return 0;
>>>> @@ -2909,6 +3240,26 @@ int ext4_mb_init(struct super_block *sb)
>>>> i++;
>>>> } while (i < MB_NUM_ORDERS(sb));
>>>>
>>>> + sbi->s_mb_avg_fragment_size_root = RB_ROOT;
>>>> + sbi->s_mb_largest_free_orders =
>>>> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
>>>> + GFP_KERNEL);
>>>> + if (!sbi->s_mb_largest_free_orders) {
>>>> + ret = -ENOMEM;
>>>> + goto out;
>>>> + }
>>>> + sbi->s_mb_largest_free_orders_locks =
>>>> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
>>>> + GFP_KERNEL);
>>>> + if (!sbi->s_mb_largest_free_orders_locks) {
>>>> + ret = -ENOMEM;
>>>> + goto out;
>>>> + }
>>>> + for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
>>>> + INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
>>>> + rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
>>>> + }
>>>> + rwlock_init(&sbi->s_mb_rb_lock);
>>>>
>>>> spin_lock_init(&sbi->s_md_lock);
>>>> sbi->s_mb_free_pending = 0;
>>>> @@ -2961,6 +3312,10 @@ int ext4_mb_init(struct super_block *sb)
>>>> spin_lock_init(&lg->lg_prealloc_lock);
>>>> }
>>>>
>>>> + if (blk_queue_nonrot(bdev_get_queue(sb->s_bdev)))
>>>> + sbi->s_mb_linear_limit = 0;
>>>> + else
>>>> + sbi->s_mb_linear_limit = MB_DEFAULT_LINEAR_LIMIT;
>>>> /* init file for buddy data */
>>>> ret = ext4_mb_init_backend(sb);
>>>> if (ret != 0)
>>>> @@ -2972,6 +3327,8 @@ int ext4_mb_init(struct super_block *sb)
>>>> free_percpu(sbi->s_locality_groups);
>>>> sbi->s_locality_groups = NULL;
>>>> out:
>>>> + kfree(sbi->s_mb_largest_free_orders);
>>>> + kfree(sbi->s_mb_largest_free_orders_locks);
>>>> kfree(sbi->s_mb_offsets);
>>>> sbi->s_mb_offsets = NULL;
>>>> kfree(sbi->s_mb_maxs);
>>>> @@ -3028,6 +3385,8 @@ int ext4_mb_release(struct super_block *sb)
>>>> kvfree(group_info);
>>>> rcu_read_unlock();
>>>> }
>>>> + kfree(sbi->s_mb_largest_free_orders);
>>>> + kfree(sbi->s_mb_largest_free_orders_locks);
>>>> kfree(sbi->s_mb_offsets);
>>>> kfree(sbi->s_mb_maxs);
>>>> iput(sbi->s_buddy_cache);
>>>> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
>>>> index 68111a10cfee..02585e3cbcad 100644
>>>> --- a/fs/ext4/mballoc.h
>>>> +++ b/fs/ext4/mballoc.h
>>>> @@ -78,6 +78,18 @@
>>>> */
>>>> #define MB_DEFAULT_MAX_INODE_PREALLOC 512
>>>>
>>>> +/*
>>>> + * Number of groups to search linearly before performing group scanning
>>>> + * optimization.
>>>> + */
>>>> +#define MB_DEFAULT_LINEAR_LIMIT 4
>>>> +
>>>> +/*
>>>> + * Minimum number of groups that should be present in the file system to perform
>>>> + * group scanning optimizations.
>>>> + */
>>>> +#define MB_DEFAULT_LINEAR_SCAN_THRESHOLD 16
>>>> +
>>>> /*
>>>> * Number of valid buddy orders
>>>> */
>>>> @@ -166,11 +178,14 @@ struct ext4_allocation_context {
>>>> /* copy of the best found extent taken before preallocation efforts */
>>>> struct ext4_free_extent ac_f_ex;
>>>>
>>>> + ext4_group_t ac_last_optimal_group;
>>>> + __u32 ac_groups_considered;
>>>> + __u32 ac_flags; /* allocation hints */
>>>> __u16 ac_groups_scanned;
>>>> + __u16 ac_groups_linear_remaining;
>>>> __u16 ac_found;
>>>> __u16 ac_tail;
>>>> __u16 ac_buddy;
>>>> - __u16 ac_flags; /* allocation hints */
>>>> __u8 ac_status;
>>>> __u8 ac_criteria;
>>>> __u8 ac_2order; /* if request is to allocate 2^N blocks and
>>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>>> index 3a2cd9fb7e73..6116640081c0 100644
>>>> --- a/fs/ext4/super.c
>>>> +++ b/fs/ext4/super.c
>>>> @@ -1687,7 +1687,7 @@ enum {
>>>> Opt_dioread_nolock, Opt_dioread_lock,
>>>> Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
>>>> Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
>>>> - Opt_prefetch_block_bitmaps,
>>>> + Opt_prefetch_block_bitmaps, Opt_mb_optimize_scan,
>>>> #ifdef CONFIG_EXT4_DEBUG
>>>> Opt_fc_debug_max_replay, Opt_fc_debug_force
>>>> #endif
>>>> @@ -1788,6 +1788,7 @@ static const match_table_t tokens = {
>>>> {Opt_nombcache, "nombcache"},
>>>> {Opt_nombcache, "no_mbcache"}, /* for backward compatibility */
>>>> {Opt_prefetch_block_bitmaps, "prefetch_block_bitmaps"},
>>>> + {Opt_mb_optimize_scan, "mb_optimize_scan=%d"},
>>>> {Opt_removed, "check=none"}, /* mount option from ext2/3 */
>>>> {Opt_removed, "nocheck"}, /* mount option from ext2/3 */
>>>> {Opt_removed, "reservation"}, /* mount option from ext2/3 */
>>>> @@ -1820,6 +1821,8 @@ static ext4_fsblk_t get_sb_block(void **data)
>>>> }
>>>>
>>>> #define DEFAULT_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
>>>> +#define DEFAULT_MB_OPTIMIZE_SCAN (-1)
>>>> +
>>>> static const char deprecated_msg[] =
>>>> "Mount option \"%s\" will be removed by %s\n"
>>>> "Contact [email protected] if you think we should keep it.\n";
>>>> @@ -2008,6 +2011,7 @@ static const struct mount_opts {
>>>> {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
>>>> {Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
>>>> MOPT_SET},
>>>> + {Opt_mb_optimize_scan, EXT4_MOUNT2_MB_OPTIMIZE_SCAN, MOPT_GTE0},
>>>> #ifdef CONFIG_EXT4_DEBUG
>>>> {Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
>>>> MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
>>>> @@ -2092,6 +2096,7 @@ static int ext4_set_test_dummy_encryption(struct super_block *sb,
>>>> struct ext4_parsed_options {
>>>> unsigned long journal_devnum;
>>>> unsigned int journal_ioprio;
>>>> + int mb_optimize_scan;
>>>> };
>>>>
>>>> static int handle_mount_opt(struct super_block *sb, char *opt, int token,
>>>> @@ -2388,6 +2393,13 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
>>>> sbi->s_mount_opt |= m->mount_opt;
>>>> } else if (token == Opt_data_err_ignore) {
>>>> sbi->s_mount_opt &= ~m->mount_opt;
>>>> + } else if (token == Opt_mb_optimize_scan) {
>>>> + if (arg != 0 && arg != 1) {
>>>> + ext4_msg(sb, KERN_WARNING,
>>>> + "mb_optimize_scan should be set to 0 or 1.");
>>>> + return -1;
>>>> + }
>>>> + parsed_opts->mb_optimize_scan = arg;
>>>> } else {
>>>> if (!args->from)
>>>> arg = 1;
>>>> @@ -4034,6 +4046,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>>> /* Set defaults for the variables that will be set during parsing */
>>>> parsed_opts.journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
>>>> parsed_opts.journal_devnum = 0;
>>>> + parsed_opts.mb_optimize_scan = DEFAULT_MB_OPTIMIZE_SCAN;
>>>>
>>>> if ((data && !orig_data) || !sbi)
>>>> goto out_free_base;
>>>> @@ -4984,6 +4997,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>>> ext4_fc_replay_cleanup(sb);
>>>>
>>>> ext4_ext_init(sb);
>>>> +
>>>> + /*
>>>> + * Enable optimize_scan if number of groups is > threshold. This can be
>>>> + * turned off by passing "mb_optimize_scan=0". This can also be
>>>> + * turned on forcefully by passing "mb_optimize_scan=1".
>>>> + */
>>>> + if (parsed_opts.mb_optimize_scan == 1)
>>>> + set_opt2(sb, MB_OPTIMIZE_SCAN);
>>>> + else if (parsed_opts.mb_optimize_scan == 0)
>>>> + clear_opt2(sb, MB_OPTIMIZE_SCAN);
>>>> + else if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD)
>>>> + set_opt2(sb, MB_OPTIMIZE_SCAN);
>>>> +
>>>> err = ext4_mb_init(sb);
>>>> if (err) {
>>>> ext4_msg(sb, KERN_ERR, "failed to initialize mballoc (%d)",
>>>> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
>>>> index 59ca9d73b42f..16b8a838f631 100644
>>>> --- a/fs/ext4/sysfs.c
>>>> +++ b/fs/ext4/sysfs.c
>>>> @@ -213,6 +213,7 @@ EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
>>>> EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
>>>> EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
>>>> EXT4_RW_ATTR_SBI_UI(mb_max_inode_prealloc, s_mb_max_inode_prealloc);
>>>> +EXT4_RW_ATTR_SBI_UI(mb_linear_limit, s_mb_linear_limit);
>>>> EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
>>>> EXT4_ATTR(trigger_fs_error, 0200, trigger_test_error);
>>>> EXT4_RW_ATTR_SBI_UI(err_ratelimit_interval_ms, s_err_ratelimit_state.interval);
>>>> @@ -260,6 +261,7 @@ static struct attribute *ext4_attrs[] = {
>>>> ATTR_LIST(mb_stream_req),
>>>> ATTR_LIST(mb_group_prealloc),
>>>> ATTR_LIST(mb_max_inode_prealloc),
>>>> + ATTR_LIST(mb_linear_limit),
>>>> ATTR_LIST(max_writeback_mb_bump),
>>>> ATTR_LIST(extent_max_zeroout_kb),
>>>> ATTR_LIST(trigger_fs_error),
>>>> --
>>>> 2.31.0.291.g576ba9dcdaf-goog
>>>>
>>>
>>>
>>> Cheers, Andreas
>>>
>>>
>>>
>>>
>>>
I ran some experiments this morning and here are the results:
1) Config 1:
- add change to skip CR0 / CR1 optimizations and fall back to
linear search when all groups are not loaded
- prefetch block bitmaps mount option turned off
- first_write_latency: 6m 56s
- number_of_groups_loaded_at_the_end_of_first_write: 30726 out of
491520 total
2) Config 2:
- add change to skip CR0 / CR1 optimizations and fall back to
linear search when all groups are not loaded
- prefetch block bitmaps mount option turned on
- first_write_latency:4m42s
- number_of_groups_loaded_at_the_end_of_first_write: 490497 out
of 491520 total
3) Config 3:
- don't fallback to linear search
- prefetch block bitmaps turned off
- first_write_latency: 0.13s
- number_of_groups_loaded_at_the_end_of_first_write: 7 out of 491520 total
4) Config 4:
- don't fallback to linear search
- prefetch block bitmaps turned on
- first_write_latency: 0.15s
- number_of_groups_loaded_at_the_end_of_first_write: 48384 out of
491520 total
Based on the above numbers, I think we should go for config 4, which
means we should have prefetch_block_bitmaps turned on when
mb_optimize_scan is on. What do you think about turning
prefetch_block_bitmaps mount option on by default?
- Harshad
On Mon, Mar 29, 2021 at 9:10 PM Andreas Dilger <[email protected]> wrote:
>
> On Mar 29, 2021, at 17:03, harshad shirwadkar <[email protected]> wrote:
> >
> > Another problem that I found in the patch was that when all groups are
> > not loaded, the optimization concludes that CR0 and CR1 cannot be used
> > for allocation (since these structures aren't populated). The problem
> > goes away if we also mount with prefetch_block_bitmaps mount option.
> > But I think if all the groups are not loaded, we should at least fall
> > back to linear search at CR1. Maybe skipping CR0 is okay.
>
> Alex's patch to skip cr0/1 if no groups are loaded is OK, as long as there is
> some group prefetch is triggered in the background, IMHO, and is there
> for later allocations.
>
> That avoids lots of scanning at mount time if the filesystem is mostly full,
> and the first allocation is not critical as long as this triggers group reads.
>
> Do you have any idea what the performance impact of this is (eg.
> latency to first write)? Ideally there would be some group prefetch done
> right at mount time (itable zero thread?) so that if writes are not
> done *right* after mount there would not be any noticeable latency.
>
> Cheers, Andreas
>
> >> On Mon, Mar 29, 2021 at 12:32 PM harshad shirwadkar
> >> <[email protected]> wrote:
> >>
> >> Okay sounds good. I'll do that!
> >>
> >>> On Mon, Mar 29, 2021 at 12:14 PM Andreas Dilger <[email protected]> wrote:
> >>>
> >>> On Mar 24, 2021, at 5:19 PM, Harshad Shirwadkar <[email protected]> wrote:
> >>>>
> >>>> Instead of traversing through groups linearly, scan groups in specific
> >>>> orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
> >>>> largest free order >= the order of the request. So, with this patch,
> >>>> we maintain lists for each possible order and insert each group into a
> >>>> list based on the largest free order in its buddy bitmap. During cr 0
> >>>> allocation, we traverse these lists in the increasing order of largest
> >>>> free orders. This allows us to find a group with the best available cr
> >>>> 0 match in constant time. If nothing can be found, we fallback to cr 1
> >>>> immediately.
> >>>>
> >>>> At CR1, the story is slightly different. We want to traverse in the
> >>>> order of increasing average fragment size. For CR1, we maintain a rb
> >>>> tree of groupinfos which is sorted by average fragment size. Instead
> >>>> of traversing linearly, at CR1, we traverse in the order of increasing
> >>>> average fragment size, starting at the most optimal group. This brings
> >>>> down cr 1 search complexity to log(num groups).
> >>>>
> >>>> For cr >= 2, we just perform the linear search as before. Also, in
> >>>> case of lock contention, we intermittently fallback to linear search
> >>>> even in CR 0 and CR 1 cases. This allows us to proceed during the
> >>>> allocation path even in case of high contention.
> >>>
> >>> If you are refreshing this patch anyway, could you rename "mb_linear_limit"
> >>> to "mb_linear_groups" or "mb_max_linear_groups" or similar? It otherwise
> >>> isn't clear what units the "limit" is in (could be MB/GB or something else).
> >>>
> >>> Cheers, Andreas
> >>>
> >>>> There is an opportunity to do optimization at CR2 too. That's because
> >>>> at CR2 we only consider groups where bb_free counter (number of free
> >>>> blocks) is greater than the request extent size. That's left as future
> >>>> work.
> >>>>
> >>>> All the changes introduced in this patch are protected under a new
> >>>> mount option "mb_optimize_scan".
> >>>>
> >>>> With this patchset, following experiment was performed:
> >>>>
> >>>> Created a highly fragmented disk of size 65TB. The disk had no
> >>>> contiguous 2M regions. Following command was run consecutively for 3
> >>>> times:
> >>>>
> >>>> time dd if=/dev/urandom of=file bs=2M count=10
> >>>>
> >>>> Here are the results with and without cr 0/1 optimizations introduced
> >>>> in this patch:
> >>>>
> >>>> |---------+------------------------------+---------------------------|
> >>>> | | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
> >>>> |---------+------------------------------+---------------------------|
> >>>> | 1st run | 5m1.871s | 2m47.642s |
> >>>> | 2nd run | 2m28.390s | 0m0.611s |
> >>>> | 3rd run | 2m26.530s | 0m1.255s |
> >>>> |---------+------------------------------+---------------------------|
> >>>>
> >>>> Signed-off-by: Harshad Shirwadkar <[email protected]>
> >>>> Reported-by: kernel test robot <[email protected]>
> >>>> Reported-by: Dan Carpenter <[email protected]>
> >>>> Reviewed-by: Andreas Dilger <[email protected]>
> >>>> ---
> >>>> fs/ext4/ext4.h | 19 ++-
> >>>> fs/ext4/mballoc.c | 381 ++++++++++++++++++++++++++++++++++++++++++++--
> >>>> fs/ext4/mballoc.h | 17 ++-
> >>>> fs/ext4/super.c | 28 +++-
> >>>> fs/ext4/sysfs.c | 2 +
> >>>> 5 files changed, 432 insertions(+), 15 deletions(-)
> >>>>
> >>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >>>> index 85eeeba3bca3..5930c8cb5159 100644
> >>>> --- a/fs/ext4/ext4.h
> >>>> +++ b/fs/ext4/ext4.h
> >>>> @@ -162,7 +162,10 @@ enum SHIFT_DIRECTION {
> >>>> #define EXT4_MB_USE_RESERVED 0x2000
> >>>> /* Do strict check for free blocks while retrying block allocation */
> >>>> #define EXT4_MB_STRICT_CHECK 0x4000
> >>>> -
> >>>> +/* Large fragment size list lookup succeeded at least once for cr = 0 */
> >>>> +#define EXT4_MB_CR0_OPTIMIZED 0x8000
> >>>> +/* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
> >>>> +#define EXT4_MB_CR1_OPTIMIZED 0x00010000
> >>>> struct ext4_allocation_request {
> >>>> /* target inode for block we're allocating */
> >>>> struct inode *inode;
> >>>> @@ -1247,7 +1250,9 @@ struct ext4_inode_info {
> >>>> #define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */
> >>>> #define EXT4_MOUNT2_DAX_NEVER 0x00000020 /* Do not allow Direct Access */
> >>>> #define EXT4_MOUNT2_DAX_INODE 0x00000040 /* For printing options only */
> >>>> -
> >>>> +#define EXT4_MOUNT2_MB_OPTIMIZE_SCAN 0x00000080 /* Optimize group
> >>>> + * scanning in mballoc
> >>>> + */
> >>>>
> >>>> #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
> >>>> ~EXT4_MOUNT_##opt
> >>>> @@ -1527,9 +1532,14 @@ struct ext4_sb_info {
> >>>> unsigned int s_mb_free_pending;
> >>>> struct list_head s_freed_data_list; /* List of blocks to be freed
> >>>> after commit completed */
> >>>> + struct rb_root s_mb_avg_fragment_size_root;
> >>>> + rwlock_t s_mb_rb_lock;
> >>>> + struct list_head *s_mb_largest_free_orders;
> >>>> + rwlock_t *s_mb_largest_free_orders_locks;
> >>>>
> >>>> /* tunables */
> >>>> unsigned long s_stripe;
> >>>> + unsigned int s_mb_linear_limit;
> >>>> unsigned int s_mb_stream_request;
> >>>> unsigned int s_mb_max_to_scan;
> >>>> unsigned int s_mb_min_to_scan;
> >>>> @@ -1553,6 +1563,8 @@ struct ext4_sb_info {
> >>>> atomic_t s_bal_goals; /* goal hits */
> >>>> atomic_t s_bal_breaks; /* too long searches */
> >>>> atomic_t s_bal_2orders; /* 2^order hits */
> >>>> + atomic_t s_bal_cr0_bad_suggestions;
> >>>> + atomic_t s_bal_cr1_bad_suggestions;
> >>>> atomic64_t s_bal_cX_groups_considered[4];
> >>>> atomic64_t s_bal_cX_hits[4];
> >>>> atomic64_t s_bal_cX_failed[4]; /* cX loop didn't find blocks */
> >>>> @@ -3309,11 +3321,14 @@ struct ext4_group_info {
> >>>> ext4_grpblk_t bb_free; /* total free blocks */
> >>>> ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
> >>>> ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */
> >>>> + ext4_group_t bb_group; /* Group number */
> >>>> struct list_head bb_prealloc_list;
> >>>> #ifdef DOUBLE_CHECK
> >>>> void *bb_bitmap;
> >>>> #endif
> >>>> struct rw_semaphore alloc_sem;
> >>>> + struct rb_node bb_avg_fragment_size_rb;
> >>>> + struct list_head bb_largest_free_order_node;
> >>>> ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
> >>>> * regions, index is order.
> >>>> * bb_counters[3] = 5 means
> >>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> >>>> index 15127d815461..cbf9a89c0ef5 100644
> >>>> --- a/fs/ext4/mballoc.c
> >>>> +++ b/fs/ext4/mballoc.c
> >>>> @@ -127,11 +127,50 @@
> >>>> * smallest multiple of the stripe value (sbi->s_stripe) which is
> >>>> * greater than the default mb_group_prealloc.
> >>>> *
> >>>> + * If "mb_optimize_scan" mount option is set, we maintain in memory group info
> >>>> + * structures in two data structures:
> >>>> + *
> >>>> + * 1) Array of largest free order lists (sbi->s_mb_largest_free_orders)
> >>>> + *
> >>>> + * Locking: sbi->s_mb_largest_free_orders_locks(array of rw locks)
> >>>> + *
> >>>> + * This is an array of lists where the index in the array represents the
> >>>> + * largest free order in the buddy bitmap of the participating group infos of
> >>>> + * that list. So, there are exactly MB_NUM_ORDERS(sb) (which means total
> >>>> + * number of buddy bitmap orders possible) number of lists. Group-infos are
> >>>> + * placed in appropriate lists.
> >>>> + *
> >>>> + * 2) Average fragment size rb tree (sbi->s_mb_avg_fragment_size_root)
> >>>> + *
> >>>> + * Locking: sbi->s_mb_rb_lock (rwlock)
> >>>> + *
> >>>> + * This is a red black tree consisting of group infos and the tree is sorted
> >>>> + * by average fragment sizes (which is calculated as ext4_group_info->bb_free
> >>>> + * / ext4_group_info->bb_fragments).
> >>>> + *
> >>>> + * When "mb_optimize_scan" mount option is set, mballoc consults the above data
> >>>> + * structures to decide the order in which groups are to be traversed for
> >>>> + * fulfilling an allocation request.
> >>>> + *
> >>>> + * At CR = 0, we look for groups which have the largest_free_order >= the order
> >>>> + * of the request. We directly look at the largest free order list in the data
> >>>> + * structure (1) above where largest_free_order = order of the request. If that
> >>>> + * list is empty, we look at remaining list in the increasing order of
> >>>> + * largest_free_order. This allows us to perform CR = 0 lookup in O(1) time.
> >>>> + *
> >>>> + * At CR = 1, we only consider groups where average fragment size > request
> >>>> + * size. So, we lookup a group which has average fragment size just above or
> >>>> + * equal to request size using our rb tree (data structure 2) in O(log N) time.
> >>>> + *
> >>>> + * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
> >>>> + * linear order which requires O(N) search time for each CR 0 and CR 1 phase.
> >>>> + *
> >>>> * The regular allocator (using the buddy cache) supports a few tunables.
> >>>> *
> >>>> * /sys/fs/ext4/<partition>/mb_min_to_scan
> >>>> * /sys/fs/ext4/<partition>/mb_max_to_scan
> >>>> * /sys/fs/ext4/<partition>/mb_order2_req
> >>>> + * /sys/fs/ext4/<partition>/mb_linear_limit
> >>>> *
> >>>> * The regular allocator uses buddy scan only if the request len is power of
> >>>> * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The
> >>>> @@ -149,6 +188,16 @@
> >>>> * can be used for allocation. ext4_mb_good_group explains how the groups are
> >>>> * checked.
> >>>> *
> >>>> + * When "mb_optimize_scan" is turned on, as mentioned above, the groups may not
> >>>> + * get traversed linearly. That may result in subsequent allocations being not
> >>>> + * close to each other. And so, the underlying device may get filled up in a
> >>>> + * non-linear fashion. While that may not matter on non-rotational devices, for
> >>>> + * rotational devices that may result in higher seek times. "mb_linear_limit"
> >>>> + * tells mballoc how many groups mballoc should search linearly before
> >>>> + * performing consulting above data structures for more efficient lookups. For
> >>>> + * non rotational devices, this value defaults to 0 and for rotational devices
> >>>> + * this is set to MB_DEFAULT_LINEAR_LIMIT.
> >>>> + *
> >>>> * Both the prealloc space are getting populated as above. So for the first
> >>>> * request we will hit the buddy cache which will result in this prealloc
> >>>> * space getting filled. The prealloc space is then later used for the
> >>>> @@ -299,6 +348,8 @@
> >>>> * - bitlock on a group (group)
> >>>> * - object (inode/locality) (object)
> >>>> * - per-pa lock (pa)
> >>>> + * - cr0 lists lock (cr0)
> >>>> + * - cr1 tree lock (cr1)
> >>>> *
> >>>> * Paths:
> >>>> * - new pa
> >>>> @@ -328,6 +379,9 @@
> >>>> * group
> >>>> * object
> >>>> *
> >>>> + * - allocation path (ext4_mb_regular_allocator)
> >>>> + * group
> >>>> + * cr0/cr1
> >>>> */
> >>>> static struct kmem_cache *ext4_pspace_cachep;
> >>>> static struct kmem_cache *ext4_ac_cachep;
> >>>> @@ -351,6 +405,9 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
> >>>> ext4_group_t group);
> >>>> static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
> >>>>
> >>>> +static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
> >>>> + ext4_group_t group, int cr);
> >>>> +
> >>>> /*
> >>>> * The algorithm using this percpu seq counter goes below:
> >>>> * 1. We sample the percpu discard_pa_seq counter before trying for block
> >>>> @@ -744,6 +801,251 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> >>>> }
> >>>> }
> >>>>
> >>>> +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
> >>>> + int (*cmp)(struct rb_node *, struct rb_node *))
> >>>> +{
> >>>> + struct rb_node **iter = &root->rb_node, *parent = NULL;
> >>>> +
> >>>> + while (*iter) {
> >>>> + parent = *iter;
> >>>> + if (cmp(new, *iter) > 0)
> >>>> + iter = &((*iter)->rb_left);
> >>>> + else
> >>>> + iter = &((*iter)->rb_right);
> >>>> + }
> >>>> +
> >>>> + rb_link_node(new, parent, iter);
> >>>> + rb_insert_color(new, root);
> >>>> +}
> >>>> +
> >>>> +static int
> >>>> +ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
> >>>> +{
> >>>> + struct ext4_group_info *grp1 = rb_entry(rb1,
> >>>> + struct ext4_group_info,
> >>>> + bb_avg_fragment_size_rb);
> >>>> + struct ext4_group_info *grp2 = rb_entry(rb2,
> >>>> + struct ext4_group_info,
> >>>> + bb_avg_fragment_size_rb);
> >>>> + int num_frags_1, num_frags_2;
> >>>> +
> >>>> + num_frags_1 = grp1->bb_fragments ?
> >>>> + grp1->bb_free / grp1->bb_fragments : 0;
> >>>> + num_frags_2 = grp2->bb_fragments ?
> >>>> + grp2->bb_free / grp2->bb_fragments : 0;
> >>>> +
> >>>> + return (num_frags_2 - num_frags_1);
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> + * Reinsert grpinfo into the avg_fragment_size tree with new average
> >>>> + * fragment size.
> >>>> + */
> >>>> +static void
> >>>> +mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
> >>>> +{
> >>>> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> >>>> +
> >>>> + if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
> >>>> + return;
> >>>> +
> >>>> + write_lock(&sbi->s_mb_rb_lock);
> >>>> + if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
> >>>> + rb_erase(&grp->bb_avg_fragment_size_rb,
> >>>> + &sbi->s_mb_avg_fragment_size_root);
> >>>> + RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
> >>>> + }
> >>>> +
> >>>> + ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
> >>>> + &grp->bb_avg_fragment_size_rb,
> >>>> + ext4_mb_avg_fragment_size_cmp);
> >>>> + write_unlock(&sbi->s_mb_rb_lock);
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> + * Choose next group by traversing largest_free_order lists. Return 0 if next
> >>>> + * group was selected optimally. Return 1 if next group was not selected
> >>>> + * optimally. Updates *new_cr if cr level needs an update.
> >>>> + */
> >>>> +static int ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> >>>> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> >>>> +{
> >>>> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> >>>> + struct ext4_group_info *iter, *grp;
> >>>> + int i;
> >>>> +
> >>>> + if (ac->ac_status == AC_STATUS_FOUND)
> >>>> + return 1;
> >>>> +
> >>>> + if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
> >>>> + atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
> >>>> +
> >>>> + grp = NULL;
> >>>> + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> >>>> + if (list_empty(&sbi->s_mb_largest_free_orders[i]))
> >>>> + continue;
> >>>> + read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
> >>>> + if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
> >>>> + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> >>>> + continue;
> >>>> + }
> >>>> + grp = NULL;
> >>>> + list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
> >>>> + bb_largest_free_order_node) {
> >>>> + if (sbi->s_mb_stats)
> >>>> + atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
> >>>> + if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
> >>>> + grp = iter;
> >>>> + break;
> >>>> + }
> >>>> + }
> >>>> + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> >>>> + if (grp)
> >>>> + break;
> >>>> + }
> >>>> +
> >>>> + if (!grp) {
> >>>> + /* Increment cr and search again */
> >>>> + *new_cr = 1;
> >>>> + } else {
> >>>> + *group = grp->bb_group;
> >>>> + ac->ac_last_optimal_group = *group;
> >>>> + ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
> >>>> + }
> >>>> + return 0;
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> + * Choose next group by traversing average fragment size tree. Return 0 if next
> >>>> + * group was selected optimally. Return 1 if next group could not selected
> >>>> + * optimally (due to lock contention). Updates *new_cr if cr lvel needs an
> >>>> + * update.
> >>>> + */
> >>>> +static int ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> >>>> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> >>>> +{
> >>>> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> >>>> + int avg_fragment_size, best_so_far;
> >>>> + struct rb_node *node, *found;
> >>>> + struct ext4_group_info *grp;
> >>>> +
> >>>> + /*
> >>>> + * If there is contention on the lock, instead of waiting for the lock
> >>>> + * to become available, just continue searching lineraly. We'll resume
> >>>> + * our rb tree search later starting at ac->ac_last_optimal_group.
> >>>> + */
> >>>> + if (!read_trylock(&sbi->s_mb_rb_lock))
> >>>> + return 1;
> >>>> +
> >>>> + if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
> >>>> + if (sbi->s_mb_stats)
> >>>> + atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
> >>>> + /* We have found something at CR 1 in the past */
> >>>> + grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
> >>>> + for (found = rb_next(&grp->bb_avg_fragment_size_rb); found != NULL;
> >>>> + found = rb_next(found)) {
> >>>> + grp = rb_entry(found, struct ext4_group_info,
> >>>> + bb_avg_fragment_size_rb);
> >>>> + if (sbi->s_mb_stats)
> >>>> + atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
> >>>> + if (likely(ext4_mb_good_group(ac, grp->bb_group, 1)))
> >>>> + break;
> >>>> + }
> >>>> +
> >>>> + goto done;
> >>>> + }
> >>>> +
> >>>> + node = sbi->s_mb_avg_fragment_size_root.rb_node;
> >>>> + best_so_far = 0;
> >>>> + found = NULL;
> >>>> +
> >>>> + while (node) {
> >>>> + grp = rb_entry(node, struct ext4_group_info,
> >>>> + bb_avg_fragment_size_rb);
> >>>> + avg_fragment_size = 0;
> >>>> + /*
> >>>> + * Perform this check without locking, we'll lock later to confirm.
> >>>> + */
> >>>> + if (ext4_mb_good_group(ac, grp->bb_group, 1)) {
> >>>> + avg_fragment_size = grp->bb_fragments ?
> >>>> + grp->bb_free / grp->bb_fragments : 0;
> >>>> + if (!best_so_far || avg_fragment_size < best_so_far) {
> >>>> + best_so_far = avg_fragment_size;
> >>>> + found = node;
> >>>> + }
> >>>> + }
> >>>> + if (avg_fragment_size > ac->ac_g_ex.fe_len)
> >>>> + node = node->rb_right;
> >>>> + else
> >>>> + node = node->rb_left;
> >>>> + }
> >>>> +
> >>>> +done:
> >>>> + if (found) {
> >>>> + grp = rb_entry(found, struct ext4_group_info,
> >>>> + bb_avg_fragment_size_rb);
> >>>> + *group = grp->bb_group;
> >>>> + ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
> >>>> + } else {
> >>>> + *new_cr = 2;
> >>>> + }
> >>>> +
> >>>> + read_unlock(&sbi->s_mb_rb_lock);
> >>>> + ac->ac_last_optimal_group = *group;
> >>>> + return 0;
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> + * ext4_mb_choose_next_group: choose next group for allocation.
> >>>> + *
> >>>> + * @ac Allocation Context
> >>>> + * @new_cr This is an output parameter. If the there is no good group
> >>>> + * available at current CR level, this field is updated to indicate
> >>>> + * the new cr level that should be used.
> >>>> + * @group This is an input / output parameter. As an input it indicates the
> >>>> + * last group used for allocation. As output, this field indicates
> >>>> + * the next group that should be used.
> >>>> + * @ngroups Total number of groups
> >>>> + */
> >>>> +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> >>>> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> >>>> +{
> >>>> + int ret;
> >>>> +
> >>>> + *new_cr = ac->ac_criteria;
> >>>> +
> >>>> + if (!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN) ||
> >>>> + *new_cr >= 2 ||
> >>>> + !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
> >>>> + goto inc_and_return;
> >>>> +
> >>>> + if (ac->ac_groups_linear_remaining) {
> >>>> + ac->ac_groups_linear_remaining--;
> >>>> + goto inc_and_return;
> >>>> + }
> >>>> +
> >>>> + if (*new_cr == 0) {
> >>>> + ret = ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
> >>>> + if (ret)
> >>>> + goto inc_and_return;
> >>>> + }
> >>>> + if (*new_cr == 1) {
> >>>> + ret = ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
> >>>> + if (ret)
> >>>> + goto inc_and_return;
> >>>> + }
> >>>> + return;
> >>>> +
> >>>> +inc_and_return:
> >>>> + /*
> >>>> + * Artificially restricted ngroups for non-extent
> >>>> + * files makes group > ngroups possible on first loop.
> >>>> + */
> >>>> + *group = *group + 1;
> >>>> + if (*group >= ngroups)
> >>>> + *group = 0;
> >>>> +}
> >>>> +
> >>>> /*
> >>>> * Cache the order of the largest free extent we have available in this block
> >>>> * group.
> >>>> @@ -751,18 +1053,33 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> >>>> static void
> >>>> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
> >>>> {
> >>>> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> >>>> int i;
> >>>> - int bits;
> >>>>
> >>>> + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
> >>>> + write_lock(&sbi->s_mb_largest_free_orders_locks[
> >>>> + grp->bb_largest_free_order]);
> >>>> + list_del_init(&grp->bb_largest_free_order_node);
> >>>> + write_unlock(&sbi->s_mb_largest_free_orders_locks[
> >>>> + grp->bb_largest_free_order]);
> >>>> + }
> >>>> grp->bb_largest_free_order = -1; /* uninit */
> >>>>
> >>>> - bits = MB_NUM_ORDERS(sb) - 1;
> >>>> - for (i = bits; i >= 0; i--) {
> >>>> + for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
> >>>> if (grp->bb_counters[i] > 0) {
> >>>> grp->bb_largest_free_order = i;
> >>>> break;
> >>>> }
> >>>> }
> >>>> + if (test_opt2(sb, MB_OPTIMIZE_SCAN) &&
> >>>> + grp->bb_largest_free_order >= 0 && grp->bb_free) {
> >>>> + write_lock(&sbi->s_mb_largest_free_orders_locks[
> >>>> + grp->bb_largest_free_order]);
> >>>> + list_add_tail(&grp->bb_largest_free_order_node,
> >>>> + &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> >>>> + write_unlock(&sbi->s_mb_largest_free_orders_locks[
> >>>> + grp->bb_largest_free_order]);
> >>>> + }
> >>>> }
> >>>>
> >>>> static noinline_for_stack
> >>>> @@ -818,6 +1135,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
> >>>> period = get_cycles() - period;
> >>>> atomic_inc(&sbi->s_mb_buddies_generated);
> >>>> atomic64_add(period, &sbi->s_mb_generation_time);
> >>>> + mb_update_avg_fragment_size(sb, grp);
> >>>> }
> >>>>
> >>>> /* The buddy information is attached the buddy cache inode
> >>>> @@ -1517,6 +1835,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
> >>>>
> >>>> done:
> >>>> mb_set_largest_free_order(sb, e4b->bd_info);
> >>>> + mb_update_avg_fragment_size(sb, e4b->bd_info);
> >>>> mb_check_buddy(e4b);
> >>>> }
> >>>>
> >>>> @@ -1653,6 +1972,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
> >>>> }
> >>>> mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
> >>>>
> >>>> + mb_update_avg_fragment_size(e4b->bd_sb, e4b->bd_info);
> >>>> ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
> >>>> mb_check_buddy(e4b);
> >>>>
> >>>> @@ -2347,17 +2667,21 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> >>>> * from the goal value specified
> >>>> */
> >>>> group = ac->ac_g_ex.fe_group;
> >>>> + ac->ac_last_optimal_group = group;
> >>>> + ac->ac_groups_linear_remaining = sbi->s_mb_linear_limit;
> >>>> prefetch_grp = group;
> >>>>
> >>>> - for (i = 0; i < ngroups; group++, i++) {
> >>>> - int ret = 0;
> >>>> + for (i = 0; i < ngroups; i++) {
> >>>> + int ret = 0, new_cr;
> >>>> +
> >>>> cond_resched();
> >>>> - /*
> >>>> - * Artificially restricted ngroups for non-extent
> >>>> - * files makes group > ngroups possible on first loop.
> >>>> - */
> >>>> - if (group >= ngroups)
> >>>> - group = 0;
> >>>> +
> >>>> + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
> >>>> +
> >>>> + if (new_cr != cr) {
> >>>> + cr = new_cr;
> >>>> + goto repeat;
> >>>> + }
> >>>>
> >>>> /*
> >>>> * Batch reads of the block allocation bitmaps
> >>>> @@ -2578,6 +2902,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
> >>>> atomic64_read(&sbi->s_bal_cX_groups_considered[0]));
> >>>> seq_printf(seq, "\t\tuseless_loops: %llu\n",
> >>>> atomic64_read(&sbi->s_bal_cX_failed[0]));
> >>>> + seq_printf(seq, "\t\tbad_suggestions: %u\n",
> >>>> + atomic_read(&sbi->s_bal_cr0_bad_suggestions));
> >>>>
> >>>> seq_puts(seq, "\tcr1_stats:\n");
> >>>> seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[1]));
> >>>> @@ -2585,6 +2911,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
> >>>> atomic64_read(&sbi->s_bal_cX_groups_considered[1]));
> >>>> seq_printf(seq, "\t\tuseless_loops: %llu\n",
> >>>> atomic64_read(&sbi->s_bal_cX_failed[1]));
> >>>> + seq_printf(seq, "\t\tbad_suggestions: %u\n",
> >>>> + atomic_read(&sbi->s_bal_cr1_bad_suggestions));
> >>>>
> >>>> seq_puts(seq, "\tcr2_stats:\n");
> >>>> seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[2]));
> >>>> @@ -2719,7 +3047,10 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
> >>>> INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
> >>>> init_rwsem(&meta_group_info[i]->alloc_sem);
> >>>> meta_group_info[i]->bb_free_root = RB_ROOT;
> >>>> + INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
> >>>> + RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb);
> >>>> meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
> >>>> + meta_group_info[i]->bb_group = group;
> >>>>
> >>>> mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
> >>>> return 0;
> >>>> @@ -2909,6 +3240,26 @@ int ext4_mb_init(struct super_block *sb)
> >>>> i++;
> >>>> } while (i < MB_NUM_ORDERS(sb));
> >>>>
> >>>> + sbi->s_mb_avg_fragment_size_root = RB_ROOT;
> >>>> + sbi->s_mb_largest_free_orders =
> >>>> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> >>>> + GFP_KERNEL);
> >>>> + if (!sbi->s_mb_largest_free_orders) {
> >>>> + ret = -ENOMEM;
> >>>> + goto out;
> >>>> + }
> >>>> + sbi->s_mb_largest_free_orders_locks =
> >>>> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> >>>> + GFP_KERNEL);
> >>>> + if (!sbi->s_mb_largest_free_orders_locks) {
> >>>> + ret = -ENOMEM;
> >>>> + goto out;
> >>>> + }
> >>>> + for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> >>>> + INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
> >>>> + rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
> >>>> + }
> >>>> + rwlock_init(&sbi->s_mb_rb_lock);
> >>>>
> >>>> spin_lock_init(&sbi->s_md_lock);
> >>>> sbi->s_mb_free_pending = 0;
> >>>> @@ -2961,6 +3312,10 @@ int ext4_mb_init(struct super_block *sb)
> >>>> spin_lock_init(&lg->lg_prealloc_lock);
> >>>> }
> >>>>
> >>>> + if (blk_queue_nonrot(bdev_get_queue(sb->s_bdev)))
> >>>> + sbi->s_mb_linear_limit = 0;
> >>>> + else
> >>>> + sbi->s_mb_linear_limit = MB_DEFAULT_LINEAR_LIMIT;
> >>>> /* init file for buddy data */
> >>>> ret = ext4_mb_init_backend(sb);
> >>>> if (ret != 0)
> >>>> @@ -2972,6 +3327,8 @@ int ext4_mb_init(struct super_block *sb)
> >>>> free_percpu(sbi->s_locality_groups);
> >>>> sbi->s_locality_groups = NULL;
> >>>> out:
> >>>> + kfree(sbi->s_mb_largest_free_orders);
> >>>> + kfree(sbi->s_mb_largest_free_orders_locks);
> >>>> kfree(sbi->s_mb_offsets);
> >>>> sbi->s_mb_offsets = NULL;
> >>>> kfree(sbi->s_mb_maxs);
> >>>> @@ -3028,6 +3385,8 @@ int ext4_mb_release(struct super_block *sb)
> >>>> kvfree(group_info);
> >>>> rcu_read_unlock();
> >>>> }
> >>>> + kfree(sbi->s_mb_largest_free_orders);
> >>>> + kfree(sbi->s_mb_largest_free_orders_locks);
> >>>> kfree(sbi->s_mb_offsets);
> >>>> kfree(sbi->s_mb_maxs);
> >>>> iput(sbi->s_buddy_cache);
> >>>> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> >>>> index 68111a10cfee..02585e3cbcad 100644
> >>>> --- a/fs/ext4/mballoc.h
> >>>> +++ b/fs/ext4/mballoc.h
> >>>> @@ -78,6 +78,18 @@
> >>>> */
> >>>> #define MB_DEFAULT_MAX_INODE_PREALLOC 512
> >>>>
> >>>> +/*
> >>>> + * Number of groups to search linearly before performing group scanning
> >>>> + * optimization.
> >>>> + */
> >>>> +#define MB_DEFAULT_LINEAR_LIMIT 4
> >>>> +
> >>>> +/*
> >>>> + * Minimum number of groups that should be present in the file system to perform
> >>>> + * group scanning optimizations.
> >>>> + */
> >>>> +#define MB_DEFAULT_LINEAR_SCAN_THRESHOLD 16
> >>>> +
> >>>> /*
> >>>> * Number of valid buddy orders
> >>>> */
> >>>> @@ -166,11 +178,14 @@ struct ext4_allocation_context {
> >>>> /* copy of the best found extent taken before preallocation efforts */
> >>>> struct ext4_free_extent ac_f_ex;
> >>>>
> >>>> + ext4_group_t ac_last_optimal_group;
> >>>> + __u32 ac_groups_considered;
> >>>> + __u32 ac_flags; /* allocation hints */
> >>>> __u16 ac_groups_scanned;
> >>>> + __u16 ac_groups_linear_remaining;
> >>>> __u16 ac_found;
> >>>> __u16 ac_tail;
> >>>> __u16 ac_buddy;
> >>>> - __u16 ac_flags; /* allocation hints */
> >>>> __u8 ac_status;
> >>>> __u8 ac_criteria;
> >>>> __u8 ac_2order; /* if request is to allocate 2^N blocks and
> >>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> >>>> index 3a2cd9fb7e73..6116640081c0 100644
> >>>> --- a/fs/ext4/super.c
> >>>> +++ b/fs/ext4/super.c
> >>>> @@ -1687,7 +1687,7 @@ enum {
> >>>> Opt_dioread_nolock, Opt_dioread_lock,
> >>>> Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
> >>>> Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
> >>>> - Opt_prefetch_block_bitmaps,
> >>>> + Opt_prefetch_block_bitmaps, Opt_mb_optimize_scan,
> >>>> #ifdef CONFIG_EXT4_DEBUG
> >>>> Opt_fc_debug_max_replay, Opt_fc_debug_force
> >>>> #endif
> >>>> @@ -1788,6 +1788,7 @@ static const match_table_t tokens = {
> >>>> {Opt_nombcache, "nombcache"},
> >>>> {Opt_nombcache, "no_mbcache"}, /* for backward compatibility */
> >>>> {Opt_prefetch_block_bitmaps, "prefetch_block_bitmaps"},
> >>>> + {Opt_mb_optimize_scan, "mb_optimize_scan=%d"},
> >>>> {Opt_removed, "check=none"}, /* mount option from ext2/3 */
> >>>> {Opt_removed, "nocheck"}, /* mount option from ext2/3 */
> >>>> {Opt_removed, "reservation"}, /* mount option from ext2/3 */
> >>>> @@ -1820,6 +1821,8 @@ static ext4_fsblk_t get_sb_block(void **data)
> >>>> }
> >>>>
> >>>> #define DEFAULT_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
> >>>> +#define DEFAULT_MB_OPTIMIZE_SCAN (-1)
> >>>> +
> >>>> static const char deprecated_msg[] =
> >>>> "Mount option \"%s\" will be removed by %s\n"
> >>>> "Contact [email protected] if you think we should keep it.\n";
> >>>> @@ -2008,6 +2011,7 @@ static const struct mount_opts {
> >>>> {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
> >>>> {Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
> >>>> MOPT_SET},
> >>>> + {Opt_mb_optimize_scan, EXT4_MOUNT2_MB_OPTIMIZE_SCAN, MOPT_GTE0},
> >>>> #ifdef CONFIG_EXT4_DEBUG
> >>>> {Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
> >>>> MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
> >>>> @@ -2092,6 +2096,7 @@ static int ext4_set_test_dummy_encryption(struct super_block *sb,
> >>>> struct ext4_parsed_options {
> >>>> unsigned long journal_devnum;
> >>>> unsigned int journal_ioprio;
> >>>> + int mb_optimize_scan;
> >>>> };
> >>>>
> >>>> static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> >>>> @@ -2388,6 +2393,13 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> >>>> sbi->s_mount_opt |= m->mount_opt;
> >>>> } else if (token == Opt_data_err_ignore) {
> >>>> sbi->s_mount_opt &= ~m->mount_opt;
> >>>> + } else if (token == Opt_mb_optimize_scan) {
> >>>> + if (arg != 0 && arg != 1) {
> >>>> + ext4_msg(sb, KERN_WARNING,
> >>>> + "mb_optimize_scan should be set to 0 or 1.");
> >>>> + return -1;
> >>>> + }
> >>>> + parsed_opts->mb_optimize_scan = arg;
> >>>> } else {
> >>>> if (!args->from)
> >>>> arg = 1;
> >>>> @@ -4034,6 +4046,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> >>>> /* Set defaults for the variables that will be set during parsing */
> >>>> parsed_opts.journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
> >>>> parsed_opts.journal_devnum = 0;
> >>>> + parsed_opts.mb_optimize_scan = DEFAULT_MB_OPTIMIZE_SCAN;
> >>>>
> >>>> if ((data && !orig_data) || !sbi)
> >>>> goto out_free_base;
> >>>> @@ -4984,6 +4997,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> >>>> ext4_fc_replay_cleanup(sb);
> >>>>
> >>>> ext4_ext_init(sb);
> >>>> +
> >>>> + /*
> >>>> + * Enable optimize_scan if number of groups is > threshold. This can be
> >>>> + * turned off by passing "mb_optimize_scan=0". This can also be
> >>>> + * turned on forcefully by passing "mb_optimize_scan=1".
> >>>> + */
> >>>> + if (parsed_opts.mb_optimize_scan == 1)
> >>>> + set_opt2(sb, MB_OPTIMIZE_SCAN);
> >>>> + else if (parsed_opts.mb_optimize_scan == 0)
> >>>> + clear_opt2(sb, MB_OPTIMIZE_SCAN);
> >>>> + else if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD)
> >>>> + set_opt2(sb, MB_OPTIMIZE_SCAN);
> >>>> +
> >>>> err = ext4_mb_init(sb);
> >>>> if (err) {
> >>>> ext4_msg(sb, KERN_ERR, "failed to initialize mballoc (%d)",
> >>>> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> >>>> index 59ca9d73b42f..16b8a838f631 100644
> >>>> --- a/fs/ext4/sysfs.c
> >>>> +++ b/fs/ext4/sysfs.c
> >>>> @@ -213,6 +213,7 @@ EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
> >>>> EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
> >>>> EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
> >>>> EXT4_RW_ATTR_SBI_UI(mb_max_inode_prealloc, s_mb_max_inode_prealloc);
> >>>> +EXT4_RW_ATTR_SBI_UI(mb_linear_limit, s_mb_linear_limit);
> >>>> EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
> >>>> EXT4_ATTR(trigger_fs_error, 0200, trigger_test_error);
> >>>> EXT4_RW_ATTR_SBI_UI(err_ratelimit_interval_ms, s_err_ratelimit_state.interval);
> >>>> @@ -260,6 +261,7 @@ static struct attribute *ext4_attrs[] = {
> >>>> ATTR_LIST(mb_stream_req),
> >>>> ATTR_LIST(mb_group_prealloc),
> >>>> ATTR_LIST(mb_max_inode_prealloc),
> >>>> + ATTR_LIST(mb_linear_limit),
> >>>> ATTR_LIST(max_writeback_mb_bump),
> >>>> ATTR_LIST(extent_max_zeroout_kb),
> >>>> ATTR_LIST(trigger_fs_error),
> >>>> --
> >>>> 2.31.0.291.g576ba9dcdaf-goog
> >>>>
> >>>
> >>>
> >>> Cheers, Andreas
> >>>
> >>>
> >>>
> >>>
> >>>
On Mar 30, 2021, at 09:20, harshad shirwadkar <[email protected]> wrote:
>
> I ran some experiments this morning and here are the results:
>
> 1) Config 1:
> - add change to skip CR0 / CR1 optimizations and fall back to
> linear search when all groups are not loaded
> - prefetch block bitmaps mount option turned off
> - first_write_latency: 6m 56s
> - number_of_groups_loaded_at_the_end_of_first_write: 30726 out of
> 491520 total
>
> 2) Config 2:
> - add change to skip CR0 / CR1 optimizations and fall back to
> linear search when all groups are not loaded
> - prefetch block bitmaps mount option turned on
> - first_write_latency:4m42s
> - number_of_groups_loaded_at_the_end_of_first_write: 490497 out
> of 491520 total
>
> 3) Config 3:
> - don't fallback to linear search
> - prefetch block bitmaps turned off
> - first_write_latency: 0.13s
> - number_of_groups_loaded_at_the_end_of_first_write: 7 out of 491520 total
>
> 4) Config 4:
> - don't fallback to linear search
> - prefetch block bitmaps turned on
> - first_write_latency: 0.15s
> - number_of_groups_loaded_at_the_end_of_first_write: 48384 out of
> 491520 total
>
> Based on the above numbers, I think we should go for config 4, which
> means we should have prefetch_block_bitmaps turned on when
> mb_optimize_scan is on. What do you think about turning
> prefetch_block_bitmaps mount option on by default?
I think that makes sense.
>
> - Harshad
>
>> On Mon, Mar 29, 2021 at 9:10 PM Andreas Dilger <[email protected]> wrote:
>>
>>> On Mar 29, 2021, at 17:03, harshad shirwadkar <[email protected]> wrote:
>>>
>>> Another problem that I found in the patch was that when all groups are
>>> not loaded, the optimization concludes that CR0 and CR1 cannot be used
>>> for allocation (since these structures aren't populated). The problem
>>> goes away if we also mount with prefetch_block_bitmaps mount option.
>>> But I think if all the groups are not loaded, we should at least fall
>>> back to linear search at CR1. Maybe skipping CR0 is okay.
>>
>> Alex's patch to skip cr0/1 if no groups are loaded is OK, as long as there is
>> some group prefetch is triggered in the background, IMHO, and is there
>> for later allocations.
>>
>> That avoids lots of scanning at mount time if the filesystem is mostly full,
>> and the first allocation is not critical as long as this triggers group reads.
>>
>> Do you have any idea what the performance impact of this is (eg.
>> latency to first write)? Ideally there would be some group prefetch done
>> right at mount time (itable zero thread?) so that if writes are not
>> done *right* after mount there would not be any noticeable latency.
>>
>> Cheers, Andreas
>>
>>>> On Mon, Mar 29, 2021 at 12:32 PM harshad shirwadkar
>>>> <[email protected]> wrote:
>>>>
>>>> Okay sounds good. I'll do that!
>>>>
>>>>> On Mon, Mar 29, 2021 at 12:14 PM Andreas Dilger <[email protected]> wrote:
>>>>>
>>>>> On Mar 24, 2021, at 5:19 PM, Harshad Shirwadkar <[email protected]> wrote:
>>>>>>
>>>>>> Instead of traversing through groups linearly, scan groups in specific
>>>>>> orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
>>>>>> largest free order >= the order of the request. So, with this patch,
>>>>>> we maintain lists for each possible order and insert each group into a
>>>>>> list based on the largest free order in its buddy bitmap. During cr 0
>>>>>> allocation, we traverse these lists in the increasing order of largest
>>>>>> free orders. This allows us to find a group with the best available cr
>>>>>> 0 match in constant time. If nothing can be found, we fallback to cr 1
>>>>>> immediately.
>>>>>>
>>>>>> At CR1, the story is slightly different. We want to traverse in the
>>>>>> order of increasing average fragment size. For CR1, we maintain a rb
>>>>>> tree of groupinfos which is sorted by average fragment size. Instead
>>>>>> of traversing linearly, at CR1, we traverse in the order of increasing
>>>>>> average fragment size, starting at the most optimal group. This brings
>>>>>> down cr 1 search complexity to log(num groups).
>>>>>>
>>>>>> For cr >= 2, we just perform the linear search as before. Also, in
>>>>>> case of lock contention, we intermittently fallback to linear search
>>>>>> even in CR 0 and CR 1 cases. This allows us to proceed during the
>>>>>> allocation path even in case of high contention.
>>>>>
>>>>> If you are refreshing this patch anyway, could you rename "mb_linear_limit"
>>>>> to "mb_linear_groups" or "mb_max_linear_groups" or similar? It otherwise
>>>>> isn't clear what units the "limit" is in (could be MB/GB or something else).
>>>>>
>>>>> Cheers, Andreas
>>>>>
>>>>>> There is an opportunity to do optimization at CR2 too. That's because
>>>>>> at CR2 we only consider groups where bb_free counter (number of free
>>>>>> blocks) is greater than the request extent size. That's left as future
>>>>>> work.
>>>>>>
>>>>>> All the changes introduced in this patch are protected under a new
>>>>>> mount option "mb_optimize_scan".
>>>>>>
>>>>>> With this patchset, following experiment was performed:
>>>>>>
>>>>>> Created a highly fragmented disk of size 65TB. The disk had no
>>>>>> contiguous 2M regions. Following command was run consecutively for 3
>>>>>> times:
>>>>>>
>>>>>> time dd if=/dev/urandom of=file bs=2M count=10
>>>>>>
>>>>>> Here are the results with and without cr 0/1 optimizations introduced
>>>>>> in this patch:
>>>>>>
>>>>>> |---------+------------------------------+---------------------------|
>>>>>> | | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
>>>>>> |---------+------------------------------+---------------------------|
>>>>>> | 1st run | 5m1.871s | 2m47.642s |
>>>>>> | 2nd run | 2m28.390s | 0m0.611s |
>>>>>> | 3rd run | 2m26.530s | 0m1.255s |
>>>>>> |---------+------------------------------+---------------------------|
>>>>>>
>>>>>> Signed-off-by: Harshad Shirwadkar <[email protected]>
>>>>>> Reported-by: kernel test robot <[email protected]>
>>>>>> Reported-by: Dan Carpenter <[email protected]>
>>>>>> Reviewed-by: Andreas Dilger <[email protected]>
>>>>>> ---
>>>>>> fs/ext4/ext4.h | 19 ++-
>>>>>> fs/ext4/mballoc.c | 381 ++++++++++++++++++++++++++++++++++++++++++++--
>>>>>> fs/ext4/mballoc.h | 17 ++-
>>>>>> fs/ext4/super.c | 28 +++-
>>>>>> fs/ext4/sysfs.c | 2 +
>>>>>> 5 files changed, 432 insertions(+), 15 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>>> index 85eeeba3bca3..5930c8cb5159 100644
>>>>>> --- a/fs/ext4/ext4.h
>>>>>> +++ b/fs/ext4/ext4.h
>>>>>> @@ -162,7 +162,10 @@ enum SHIFT_DIRECTION {
>>>>>> #define EXT4_MB_USE_RESERVED 0x2000
>>>>>> /* Do strict check for free blocks while retrying block allocation */
>>>>>> #define EXT4_MB_STRICT_CHECK 0x4000
>>>>>> -
>>>>>> +/* Large fragment size list lookup succeeded at least once for cr = 0 */
>>>>>> +#define EXT4_MB_CR0_OPTIMIZED 0x8000
>>>>>> +/* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
>>>>>> +#define EXT4_MB_CR1_OPTIMIZED 0x00010000
>>>>>> struct ext4_allocation_request {
>>>>>> /* target inode for block we're allocating */
>>>>>> struct inode *inode;
>>>>>> @@ -1247,7 +1250,9 @@ struct ext4_inode_info {
>>>>>> #define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */
>>>>>> #define EXT4_MOUNT2_DAX_NEVER 0x00000020 /* Do not allow Direct Access */
>>>>>> #define EXT4_MOUNT2_DAX_INODE 0x00000040 /* For printing options only */
>>>>>> -
>>>>>> +#define EXT4_MOUNT2_MB_OPTIMIZE_SCAN 0x00000080 /* Optimize group
>>>>>> + * scanning in mballoc
>>>>>> + */
>>>>>>
>>>>>> #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
>>>>>> ~EXT4_MOUNT_##opt
>>>>>> @@ -1527,9 +1532,14 @@ struct ext4_sb_info {
>>>>>> unsigned int s_mb_free_pending;
>>>>>> struct list_head s_freed_data_list; /* List of blocks to be freed
>>>>>> after commit completed */
>>>>>> + struct rb_root s_mb_avg_fragment_size_root;
>>>>>> + rwlock_t s_mb_rb_lock;
>>>>>> + struct list_head *s_mb_largest_free_orders;
>>>>>> + rwlock_t *s_mb_largest_free_orders_locks;
>>>>>>
>>>>>> /* tunables */
>>>>>> unsigned long s_stripe;
>>>>>> + unsigned int s_mb_linear_limit;
>>>>>> unsigned int s_mb_stream_request;
>>>>>> unsigned int s_mb_max_to_scan;
>>>>>> unsigned int s_mb_min_to_scan;
>>>>>> @@ -1553,6 +1563,8 @@ struct ext4_sb_info {
>>>>>> atomic_t s_bal_goals; /* goal hits */
>>>>>> atomic_t s_bal_breaks; /* too long searches */
>>>>>> atomic_t s_bal_2orders; /* 2^order hits */
>>>>>> + atomic_t s_bal_cr0_bad_suggestions;
>>>>>> + atomic_t s_bal_cr1_bad_suggestions;
>>>>>> atomic64_t s_bal_cX_groups_considered[4];
>>>>>> atomic64_t s_bal_cX_hits[4];
>>>>>> atomic64_t s_bal_cX_failed[4]; /* cX loop didn't find blocks */
>>>>>> @@ -3309,11 +3321,14 @@ struct ext4_group_info {
>>>>>> ext4_grpblk_t bb_free; /* total free blocks */
>>>>>> ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
>>>>>> ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */
>>>>>> + ext4_group_t bb_group; /* Group number */
>>>>>> struct list_head bb_prealloc_list;
>>>>>> #ifdef DOUBLE_CHECK
>>>>>> void *bb_bitmap;
>>>>>> #endif
>>>>>> struct rw_semaphore alloc_sem;
>>>>>> + struct rb_node bb_avg_fragment_size_rb;
>>>>>> + struct list_head bb_largest_free_order_node;
>>>>>> ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
>>>>>> * regions, index is order.
>>>>>> * bb_counters[3] = 5 means
>>>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>>>> index 15127d815461..cbf9a89c0ef5 100644
>>>>>> --- a/fs/ext4/mballoc.c
>>>>>> +++ b/fs/ext4/mballoc.c
>>>>>> @@ -127,11 +127,50 @@
>>>>>> * smallest multiple of the stripe value (sbi->s_stripe) which is
>>>>>> * greater than the default mb_group_prealloc.
>>>>>> *
>>>>>> + * If "mb_optimize_scan" mount option is set, we maintain in memory group info
>>>>>> + * structures in two data structures:
>>>>>> + *
>>>>>> + * 1) Array of largest free order lists (sbi->s_mb_largest_free_orders)
>>>>>> + *
>>>>>> + * Locking: sbi->s_mb_largest_free_orders_locks(array of rw locks)
>>>>>> + *
>>>>>> + * This is an array of lists where the index in the array represents the
>>>>>> + * largest free order in the buddy bitmap of the participating group infos of
>>>>>> + * that list. So, there are exactly MB_NUM_ORDERS(sb) (which means total
>>>>>> + * number of buddy bitmap orders possible) number of lists. Group-infos are
>>>>>> + * placed in appropriate lists.
>>>>>> + *
>>>>>> + * 2) Average fragment size rb tree (sbi->s_mb_avg_fragment_size_root)
>>>>>> + *
>>>>>> + * Locking: sbi->s_mb_rb_lock (rwlock)
>>>>>> + *
>>>>>> + * This is a red black tree consisting of group infos and the tree is sorted
>>>>>> + * by average fragment sizes (which is calculated as ext4_group_info->bb_free
>>>>>> + * / ext4_group_info->bb_fragments).
>>>>>> + *
>>>>>> + * When "mb_optimize_scan" mount option is set, mballoc consults the above data
>>>>>> + * structures to decide the order in which groups are to be traversed for
>>>>>> + * fulfilling an allocation request.
>>>>>> + *
>>>>>> + * At CR = 0, we look for groups which have the largest_free_order >= the order
>>>>>> + * of the request. We directly look at the largest free order list in the data
>>>>>> + * structure (1) above where largest_free_order = order of the request. If that
>>>>>> + * list is empty, we look at remaining list in the increasing order of
>>>>>> + * largest_free_order. This allows us to perform CR = 0 lookup in O(1) time.
>>>>>> + *
>>>>>> + * At CR = 1, we only consider groups where average fragment size > request
>>>>>> + * size. So, we lookup a group which has average fragment size just above or
>>>>>> + * equal to request size using our rb tree (data structure 2) in O(log N) time.
>>>>>> + *
>>>>>> + * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
>>>>>> + * linear order which requires O(N) search time for each CR 0 and CR 1 phase.
>>>>>> + *
>>>>>> * The regular allocator (using the buddy cache) supports a few tunables.
>>>>>> *
>>>>>> * /sys/fs/ext4/<partition>/mb_min_to_scan
>>>>>> * /sys/fs/ext4/<partition>/mb_max_to_scan
>>>>>> * /sys/fs/ext4/<partition>/mb_order2_req
>>>>>> + * /sys/fs/ext4/<partition>/mb_linear_limit
>>>>>> *
>>>>>> * The regular allocator uses buddy scan only if the request len is power of
>>>>>> * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The
>>>>>> @@ -149,6 +188,16 @@
>>>>>> * can be used for allocation. ext4_mb_good_group explains how the groups are
>>>>>> * checked.
>>>>>> *
>>>>>> + * When "mb_optimize_scan" is turned on, as mentioned above, the groups may not
>>>>>> + * get traversed linearly. That may result in subsequent allocations being not
>>>>>> + * close to each other. And so, the underlying device may get filled up in a
>>>>>> + * non-linear fashion. While that may not matter on non-rotational devices, for
>>>>>> + * rotational devices that may result in higher seek times. "mb_linear_limit"
>>>>>> + * tells mballoc how many groups mballoc should search linearly before
>>>>>> + * performing consulting above data structures for more efficient lookups. For
>>>>>> + * non rotational devices, this value defaults to 0 and for rotational devices
>>>>>> + * this is set to MB_DEFAULT_LINEAR_LIMIT.
>>>>>> + *
>>>>>> * Both the prealloc space are getting populated as above. So for the first
>>>>>> * request we will hit the buddy cache which will result in this prealloc
>>>>>> * space getting filled. The prealloc space is then later used for the
>>>>>> @@ -299,6 +348,8 @@
>>>>>> * - bitlock on a group (group)
>>>>>> * - object (inode/locality) (object)
>>>>>> * - per-pa lock (pa)
>>>>>> + * - cr0 lists lock (cr0)
>>>>>> + * - cr1 tree lock (cr1)
>>>>>> *
>>>>>> * Paths:
>>>>>> * - new pa
>>>>>> @@ -328,6 +379,9 @@
>>>>>> * group
>>>>>> * object
>>>>>> *
>>>>>> + * - allocation path (ext4_mb_regular_allocator)
>>>>>> + * group
>>>>>> + * cr0/cr1
>>>>>> */
>>>>>> static struct kmem_cache *ext4_pspace_cachep;
>>>>>> static struct kmem_cache *ext4_ac_cachep;
>>>>>> @@ -351,6 +405,9 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
>>>>>> ext4_group_t group);
>>>>>> static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
>>>>>>
>>>>>> +static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
>>>>>> + ext4_group_t group, int cr);
>>>>>> +
>>>>>> /*
>>>>>> * The algorithm using this percpu seq counter goes below:
>>>>>> * 1. We sample the percpu discard_pa_seq counter before trying for block
>>>>>> @@ -744,6 +801,251 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
>>>>>> + int (*cmp)(struct rb_node *, struct rb_node *))
>>>>>> +{
>>>>>> + struct rb_node **iter = &root->rb_node, *parent = NULL;
>>>>>> +
>>>>>> + while (*iter) {
>>>>>> + parent = *iter;
>>>>>> + if (cmp(new, *iter) > 0)
>>>>>> + iter = &((*iter)->rb_left);
>>>>>> + else
>>>>>> + iter = &((*iter)->rb_right);
>>>>>> + }
>>>>>> +
>>>>>> + rb_link_node(new, parent, iter);
>>>>>> + rb_insert_color(new, root);
>>>>>> +}
>>>>>> +
>>>>>> +static int
>>>>>> +ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
>>>>>> +{
>>>>>> + struct ext4_group_info *grp1 = rb_entry(rb1,
>>>>>> + struct ext4_group_info,
>>>>>> + bb_avg_fragment_size_rb);
>>>>>> + struct ext4_group_info *grp2 = rb_entry(rb2,
>>>>>> + struct ext4_group_info,
>>>>>> + bb_avg_fragment_size_rb);
>>>>>> + int num_frags_1, num_frags_2;
>>>>>> +
>>>>>> + num_frags_1 = grp1->bb_fragments ?
>>>>>> + grp1->bb_free / grp1->bb_fragments : 0;
>>>>>> + num_frags_2 = grp2->bb_fragments ?
>>>>>> + grp2->bb_free / grp2->bb_fragments : 0;
>>>>>> +
>>>>>> + return (num_frags_2 - num_frags_1);
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * Reinsert grpinfo into the avg_fragment_size tree with new average
>>>>>> + * fragment size.
>>>>>> + */
>>>>>> +static void
>>>>>> +mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
>>>>>> +{
>>>>>> + struct ext4_sb_info *sbi = EXT4_SB(sb);
>>>>>> +
>>>>>> + if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
>>>>>> + return;
>>>>>> +
>>>>>> + write_lock(&sbi->s_mb_rb_lock);
>>>>>> + if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
>>>>>> + rb_erase(&grp->bb_avg_fragment_size_rb,
>>>>>> + &sbi->s_mb_avg_fragment_size_root);
>>>>>> + RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
>>>>>> + }
>>>>>> +
>>>>>> + ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
>>>>>> + &grp->bb_avg_fragment_size_rb,
>>>>>> + ext4_mb_avg_fragment_size_cmp);
>>>>>> + write_unlock(&sbi->s_mb_rb_lock);
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * Choose next group by traversing largest_free_order lists. Return 0 if next
>>>>>> + * group was selected optimally. Return 1 if next group was not selected
>>>>>> + * optimally. Updates *new_cr if cr level needs an update.
>>>>>> + */
>>>>>> +static int ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
>>>>>> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
>>>>>> +{
>>>>>> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
>>>>>> + struct ext4_group_info *iter, *grp;
>>>>>> + int i;
>>>>>> +
>>>>>> + if (ac->ac_status == AC_STATUS_FOUND)
>>>>>> + return 1;
>>>>>> +
>>>>>> + if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
>>>>>> + atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
>>>>>> +
>>>>>> + grp = NULL;
>>>>>> + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
>>>>>> + if (list_empty(&sbi->s_mb_largest_free_orders[i]))
>>>>>> + continue;
>>>>>> + read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
>>>>>> + if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
>>>>>> + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
>>>>>> + continue;
>>>>>> + }
>>>>>> + grp = NULL;
>>>>>> + list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
>>>>>> + bb_largest_free_order_node) {
>>>>>> + if (sbi->s_mb_stats)
>>>>>> + atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
>>>>>> + if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
>>>>>> + grp = iter;
>>>>>> + break;
>>>>>> + }
>>>>>> + }
>>>>>> + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
>>>>>> + if (grp)
>>>>>> + break;
>>>>>> + }
>>>>>> +
>>>>>> + if (!grp) {
>>>>>> + /* Increment cr and search again */
>>>>>> + *new_cr = 1;
>>>>>> + } else {
>>>>>> + *group = grp->bb_group;
>>>>>> + ac->ac_last_optimal_group = *group;
>>>>>> + ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
>>>>>> + }
>>>>>> + return 0;
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * Choose next group by traversing average fragment size tree. Return 0 if next
>>>>>> + * group was selected optimally. Return 1 if next group could not selected
>>>>>> + * optimally (due to lock contention). Updates *new_cr if cr lvel needs an
>>>>>> + * update.
>>>>>> + */
>>>>>> +static int ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
>>>>>> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
>>>>>> +{
>>>>>> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
>>>>>> + int avg_fragment_size, best_so_far;
>>>>>> + struct rb_node *node, *found;
>>>>>> + struct ext4_group_info *grp;
>>>>>> +
>>>>>> + /*
>>>>>> + * If there is contention on the lock, instead of waiting for the lock
>>>>>> + * to become available, just continue searching lineraly. We'll resume
>>>>>> + * our rb tree search later starting at ac->ac_last_optimal_group.
>>>>>> + */
>>>>>> + if (!read_trylock(&sbi->s_mb_rb_lock))
>>>>>> + return 1;
>>>>>> +
>>>>>> + if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
>>>>>> + if (sbi->s_mb_stats)
>>>>>> + atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
>>>>>> + /* We have found something at CR 1 in the past */
>>>>>> + grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
>>>>>> + for (found = rb_next(&grp->bb_avg_fragment_size_rb); found != NULL;
>>>>>> + found = rb_next(found)) {
>>>>>> + grp = rb_entry(found, struct ext4_group_info,
>>>>>> + bb_avg_fragment_size_rb);
>>>>>> + if (sbi->s_mb_stats)
>>>>>> + atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
>>>>>> + if (likely(ext4_mb_good_group(ac, grp->bb_group, 1)))
>>>>>> + break;
>>>>>> + }
>>>>>> +
>>>>>> + goto done;
>>>>>> + }
>>>>>> +
>>>>>> + node = sbi->s_mb_avg_fragment_size_root.rb_node;
>>>>>> + best_so_far = 0;
>>>>>> + found = NULL;
>>>>>> +
>>>>>> + while (node) {
>>>>>> + grp = rb_entry(node, struct ext4_group_info,
>>>>>> + bb_avg_fragment_size_rb);
>>>>>> + avg_fragment_size = 0;
>>>>>> + /*
>>>>>> + * Perform this check without locking, we'll lock later to confirm.
>>>>>> + */
>>>>>> + if (ext4_mb_good_group(ac, grp->bb_group, 1)) {
>>>>>> + avg_fragment_size = grp->bb_fragments ?
>>>>>> + grp->bb_free / grp->bb_fragments : 0;
>>>>>> + if (!best_so_far || avg_fragment_size < best_so_far) {
>>>>>> + best_so_far = avg_fragment_size;
>>>>>> + found = node;
>>>>>> + }
>>>>>> + }
>>>>>> + if (avg_fragment_size > ac->ac_g_ex.fe_len)
>>>>>> + node = node->rb_right;
>>>>>> + else
>>>>>> + node = node->rb_left;
>>>>>> + }
>>>>>> +
>>>>>> +done:
>>>>>> + if (found) {
>>>>>> + grp = rb_entry(found, struct ext4_group_info,
>>>>>> + bb_avg_fragment_size_rb);
>>>>>> + *group = grp->bb_group;
>>>>>> + ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
>>>>>> + } else {
>>>>>> + *new_cr = 2;
>>>>>> + }
>>>>>> +
>>>>>> + read_unlock(&sbi->s_mb_rb_lock);
>>>>>> + ac->ac_last_optimal_group = *group;
>>>>>> + return 0;
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * ext4_mb_choose_next_group: choose next group for allocation.
>>>>>> + *
>>>>>> + * @ac Allocation Context
>>>>>> + * @new_cr This is an output parameter. If the there is no good group
>>>>>> + * available at current CR level, this field is updated to indicate
>>>>>> + * the new cr level that should be used.
>>>>>> + * @group This is an input / output parameter. As an input it indicates the
>>>>>> + * last group used for allocation. As output, this field indicates
>>>>>> + * the next group that should be used.
>>>>>> + * @ngroups Total number of groups
>>>>>> + */
>>>>>> +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
>>>>>> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
>>>>>> +{
>>>>>> + int ret;
>>>>>> +
>>>>>> + *new_cr = ac->ac_criteria;
>>>>>> +
>>>>>> + if (!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN) ||
>>>>>> + *new_cr >= 2 ||
>>>>>> + !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
>>>>>> + goto inc_and_return;
>>>>>> +
>>>>>> + if (ac->ac_groups_linear_remaining) {
>>>>>> + ac->ac_groups_linear_remaining--;
>>>>>> + goto inc_and_return;
>>>>>> + }
>>>>>> +
>>>>>> + if (*new_cr == 0) {
>>>>>> + ret = ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
>>>>>> + if (ret)
>>>>>> + goto inc_and_return;
>>>>>> + }
>>>>>> + if (*new_cr == 1) {
>>>>>> + ret = ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
>>>>>> + if (ret)
>>>>>> + goto inc_and_return;
>>>>>> + }
>>>>>> + return;
>>>>>> +
>>>>>> +inc_and_return:
>>>>>> + /*
>>>>>> + * Artificially restricted ngroups for non-extent
>>>>>> + * files makes group > ngroups possible on first loop.
>>>>>> + */
>>>>>> + *group = *group + 1;
>>>>>> + if (*group >= ngroups)
>>>>>> + *group = 0;
>>>>>> +}
>>>>>> +
>>>>>> /*
>>>>>> * Cache the order of the largest free extent we have available in this block
>>>>>> * group.
>>>>>> @@ -751,18 +1053,33 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
>>>>>> static void
>>>>>> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
>>>>>> {
>>>>>> + struct ext4_sb_info *sbi = EXT4_SB(sb);
>>>>>> int i;
>>>>>> - int bits;
>>>>>>
>>>>>> + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
>>>>>> + write_lock(&sbi->s_mb_largest_free_orders_locks[
>>>>>> + grp->bb_largest_free_order]);
>>>>>> + list_del_init(&grp->bb_largest_free_order_node);
>>>>>> + write_unlock(&sbi->s_mb_largest_free_orders_locks[
>>>>>> + grp->bb_largest_free_order]);
>>>>>> + }
>>>>>> grp->bb_largest_free_order = -1; /* uninit */
>>>>>>
>>>>>> - bits = MB_NUM_ORDERS(sb) - 1;
>>>>>> - for (i = bits; i >= 0; i--) {
>>>>>> + for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
>>>>>> if (grp->bb_counters[i] > 0) {
>>>>>> grp->bb_largest_free_order = i;
>>>>>> break;
>>>>>> }
>>>>>> }
>>>>>> + if (test_opt2(sb, MB_OPTIMIZE_SCAN) &&
>>>>>> + grp->bb_largest_free_order >= 0 && grp->bb_free) {
>>>>>> + write_lock(&sbi->s_mb_largest_free_orders_locks[
>>>>>> + grp->bb_largest_free_order]);
>>>>>> + list_add_tail(&grp->bb_largest_free_order_node,
>>>>>> + &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
>>>>>> + write_unlock(&sbi->s_mb_largest_free_orders_locks[
>>>>>> + grp->bb_largest_free_order]);
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> static noinline_for_stack
>>>>>> @@ -818,6 +1135,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
>>>>>> period = get_cycles() - period;
>>>>>> atomic_inc(&sbi->s_mb_buddies_generated);
>>>>>> atomic64_add(period, &sbi->s_mb_generation_time);
>>>>>> + mb_update_avg_fragment_size(sb, grp);
>>>>>> }
>>>>>>
>>>>>> /* The buddy information is attached the buddy cache inode
>>>>>> @@ -1517,6 +1835,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
>>>>>>
>>>>>> done:
>>>>>> mb_set_largest_free_order(sb, e4b->bd_info);
>>>>>> + mb_update_avg_fragment_size(sb, e4b->bd_info);
>>>>>> mb_check_buddy(e4b);
>>>>>> }
>>>>>>
>>>>>> @@ -1653,6 +1972,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
>>>>>> }
>>>>>> mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
>>>>>>
>>>>>> + mb_update_avg_fragment_size(e4b->bd_sb, e4b->bd_info);
>>>>>> ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
>>>>>> mb_check_buddy(e4b);
>>>>>>
>>>>>> @@ -2347,17 +2667,21 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>>>>>> * from the goal value specified
>>>>>> */
>>>>>> group = ac->ac_g_ex.fe_group;
>>>>>> + ac->ac_last_optimal_group = group;
>>>>>> + ac->ac_groups_linear_remaining = sbi->s_mb_linear_limit;
>>>>>> prefetch_grp = group;
>>>>>>
>>>>>> - for (i = 0; i < ngroups; group++, i++) {
>>>>>> - int ret = 0;
>>>>>> + for (i = 0; i < ngroups; i++) {
>>>>>> + int ret = 0, new_cr;
>>>>>> +
>>>>>> cond_resched();
>>>>>> - /*
>>>>>> - * Artificially restricted ngroups for non-extent
>>>>>> - * files makes group > ngroups possible on first loop.
>>>>>> - */
>>>>>> - if (group >= ngroups)
>>>>>> - group = 0;
>>>>>> +
>>>>>> + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
>>>>>> +
>>>>>> + if (new_cr != cr) {
>>>>>> + cr = new_cr;
>>>>>> + goto repeat;
>>>>>> + }
>>>>>>
>>>>>> /*
>>>>>> * Batch reads of the block allocation bitmaps
>>>>>> @@ -2578,6 +2902,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
>>>>>> atomic64_read(&sbi->s_bal_cX_groups_considered[0]));
>>>>>> seq_printf(seq, "\t\tuseless_loops: %llu\n",
>>>>>> atomic64_read(&sbi->s_bal_cX_failed[0]));
>>>>>> + seq_printf(seq, "\t\tbad_suggestions: %u\n",
>>>>>> + atomic_read(&sbi->s_bal_cr0_bad_suggestions));
>>>>>>
>>>>>> seq_puts(seq, "\tcr1_stats:\n");
>>>>>> seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[1]));
>>>>>> @@ -2585,6 +2911,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
>>>>>> atomic64_read(&sbi->s_bal_cX_groups_considered[1]));
>>>>>> seq_printf(seq, "\t\tuseless_loops: %llu\n",
>>>>>> atomic64_read(&sbi->s_bal_cX_failed[1]));
>>>>>> + seq_printf(seq, "\t\tbad_suggestions: %u\n",
>>>>>> + atomic_read(&sbi->s_bal_cr1_bad_suggestions));
>>>>>>
>>>>>> seq_puts(seq, "\tcr2_stats:\n");
>>>>>> seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[2]));
>>>>>> @@ -2719,7 +3047,10 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
>>>>>> INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
>>>>>> init_rwsem(&meta_group_info[i]->alloc_sem);
>>>>>> meta_group_info[i]->bb_free_root = RB_ROOT;
>>>>>> + INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
>>>>>> + RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb);
>>>>>> meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
>>>>>> + meta_group_info[i]->bb_group = group;
>>>>>>
>>>>>> mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
>>>>>> return 0;
>>>>>> @@ -2909,6 +3240,26 @@ int ext4_mb_init(struct super_block *sb)
>>>>>> i++;
>>>>>> } while (i < MB_NUM_ORDERS(sb));
>>>>>>
>>>>>> + sbi->s_mb_avg_fragment_size_root = RB_ROOT;
>>>>>> + sbi->s_mb_largest_free_orders =
>>>>>> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
>>>>>> + GFP_KERNEL);
>>>>>> + if (!sbi->s_mb_largest_free_orders) {
>>>>>> + ret = -ENOMEM;
>>>>>> + goto out;
>>>>>> + }
>>>>>> + sbi->s_mb_largest_free_orders_locks =
>>>>>> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
>>>>>> + GFP_KERNEL);
>>>>>> + if (!sbi->s_mb_largest_free_orders_locks) {
>>>>>> + ret = -ENOMEM;
>>>>>> + goto out;
>>>>>> + }
>>>>>> + for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
>>>>>> + INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
>>>>>> + rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
>>>>>> + }
>>>>>> + rwlock_init(&sbi->s_mb_rb_lock);
>>>>>>
>>>>>> spin_lock_init(&sbi->s_md_lock);
>>>>>> sbi->s_mb_free_pending = 0;
>>>>>> @@ -2961,6 +3312,10 @@ int ext4_mb_init(struct super_block *sb)
>>>>>> spin_lock_init(&lg->lg_prealloc_lock);
>>>>>> }
>>>>>>
>>>>>> + if (blk_queue_nonrot(bdev_get_queue(sb->s_bdev)))
>>>>>> + sbi->s_mb_linear_limit = 0;
>>>>>> + else
>>>>>> + sbi->s_mb_linear_limit = MB_DEFAULT_LINEAR_LIMIT;
>>>>>> /* init file for buddy data */
>>>>>> ret = ext4_mb_init_backend(sb);
>>>>>> if (ret != 0)
>>>>>> @@ -2972,6 +3327,8 @@ int ext4_mb_init(struct super_block *sb)
>>>>>> free_percpu(sbi->s_locality_groups);
>>>>>> sbi->s_locality_groups = NULL;
>>>>>> out:
>>>>>> + kfree(sbi->s_mb_largest_free_orders);
>>>>>> + kfree(sbi->s_mb_largest_free_orders_locks);
>>>>>> kfree(sbi->s_mb_offsets);
>>>>>> sbi->s_mb_offsets = NULL;
>>>>>> kfree(sbi->s_mb_maxs);
>>>>>> @@ -3028,6 +3385,8 @@ int ext4_mb_release(struct super_block *sb)
>>>>>> kvfree(group_info);
>>>>>> rcu_read_unlock();
>>>>>> }
>>>>>> + kfree(sbi->s_mb_largest_free_orders);
>>>>>> + kfree(sbi->s_mb_largest_free_orders_locks);
>>>>>> kfree(sbi->s_mb_offsets);
>>>>>> kfree(sbi->s_mb_maxs);
>>>>>> iput(sbi->s_buddy_cache);
>>>>>> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
>>>>>> index 68111a10cfee..02585e3cbcad 100644
>>>>>> --- a/fs/ext4/mballoc.h
>>>>>> +++ b/fs/ext4/mballoc.h
>>>>>> @@ -78,6 +78,18 @@
>>>>>> */
>>>>>> #define MB_DEFAULT_MAX_INODE_PREALLOC 512
>>>>>>
>>>>>> +/*
>>>>>> + * Number of groups to search linearly before performing group scanning
>>>>>> + * optimization.
>>>>>> + */
>>>>>> +#define MB_DEFAULT_LINEAR_LIMIT 4
>>>>>> +
>>>>>> +/*
>>>>>> + * Minimum number of groups that should be present in the file system to perform
>>>>>> + * group scanning optimizations.
>>>>>> + */
>>>>>> +#define MB_DEFAULT_LINEAR_SCAN_THRESHOLD 16
>>>>>> +
>>>>>> /*
>>>>>> * Number of valid buddy orders
>>>>>> */
>>>>>> @@ -166,11 +178,14 @@ struct ext4_allocation_context {
>>>>>> /* copy of the best found extent taken before preallocation efforts */
>>>>>> struct ext4_free_extent ac_f_ex;
>>>>>>
>>>>>> + ext4_group_t ac_last_optimal_group;
>>>>>> + __u32 ac_groups_considered;
>>>>>> + __u32 ac_flags; /* allocation hints */
>>>>>> __u16 ac_groups_scanned;
>>>>>> + __u16 ac_groups_linear_remaining;
>>>>>> __u16 ac_found;
>>>>>> __u16 ac_tail;
>>>>>> __u16 ac_buddy;
>>>>>> - __u16 ac_flags; /* allocation hints */
>>>>>> __u8 ac_status;
>>>>>> __u8 ac_criteria;
>>>>>> __u8 ac_2order; /* if request is to allocate 2^N blocks and
>>>>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>>>>> index 3a2cd9fb7e73..6116640081c0 100644
>>>>>> --- a/fs/ext4/super.c
>>>>>> +++ b/fs/ext4/super.c
>>>>>> @@ -1687,7 +1687,7 @@ enum {
>>>>>> Opt_dioread_nolock, Opt_dioread_lock,
>>>>>> Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
>>>>>> Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
>>>>>> - Opt_prefetch_block_bitmaps,
>>>>>> + Opt_prefetch_block_bitmaps, Opt_mb_optimize_scan,
>>>>>> #ifdef CONFIG_EXT4_DEBUG
>>>>>> Opt_fc_debug_max_replay, Opt_fc_debug_force
>>>>>> #endif
>>>>>> @@ -1788,6 +1788,7 @@ static const match_table_t tokens = {
>>>>>> {Opt_nombcache, "nombcache"},
>>>>>> {Opt_nombcache, "no_mbcache"}, /* for backward compatibility */
>>>>>> {Opt_prefetch_block_bitmaps, "prefetch_block_bitmaps"},
>>>>>> + {Opt_mb_optimize_scan, "mb_optimize_scan=%d"},
>>>>>> {Opt_removed, "check=none"}, /* mount option from ext2/3 */
>>>>>> {Opt_removed, "nocheck"}, /* mount option from ext2/3 */
>>>>>> {Opt_removed, "reservation"}, /* mount option from ext2/3 */
>>>>>> @@ -1820,6 +1821,8 @@ static ext4_fsblk_t get_sb_block(void **data)
>>>>>> }
>>>>>>
>>>>>> #define DEFAULT_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
>>>>>> +#define DEFAULT_MB_OPTIMIZE_SCAN (-1)
>>>>>> +
>>>>>> static const char deprecated_msg[] =
>>>>>> "Mount option \"%s\" will be removed by %s\n"
>>>>>> "Contact [email protected] if you think we should keep it.\n";
>>>>>> @@ -2008,6 +2011,7 @@ static const struct mount_opts {
>>>>>> {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
>>>>>> {Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
>>>>>> MOPT_SET},
>>>>>> + {Opt_mb_optimize_scan, EXT4_MOUNT2_MB_OPTIMIZE_SCAN, MOPT_GTE0},
>>>>>> #ifdef CONFIG_EXT4_DEBUG
>>>>>> {Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
>>>>>> MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
>>>>>> @@ -2092,6 +2096,7 @@ static int ext4_set_test_dummy_encryption(struct super_block *sb,
>>>>>> struct ext4_parsed_options {
>>>>>> unsigned long journal_devnum;
>>>>>> unsigned int journal_ioprio;
>>>>>> + int mb_optimize_scan;
>>>>>> };
>>>>>>
>>>>>> static int handle_mount_opt(struct super_block *sb, char *opt, int token,
>>>>>> @@ -2388,6 +2393,13 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
>>>>>> sbi->s_mount_opt |= m->mount_opt;
>>>>>> } else if (token == Opt_data_err_ignore) {
>>>>>> sbi->s_mount_opt &= ~m->mount_opt;
>>>>>> + } else if (token == Opt_mb_optimize_scan) {
>>>>>> + if (arg != 0 && arg != 1) {
>>>>>> + ext4_msg(sb, KERN_WARNING,
>>>>>> + "mb_optimize_scan should be set to 0 or 1.");
>>>>>> + return -1;
>>>>>> + }
>>>>>> + parsed_opts->mb_optimize_scan = arg;
>>>>>> } else {
>>>>>> if (!args->from)
>>>>>> arg = 1;
>>>>>> @@ -4034,6 +4046,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>>>>> /* Set defaults for the variables that will be set during parsing */
>>>>>> parsed_opts.journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
>>>>>> parsed_opts.journal_devnum = 0;
>>>>>> + parsed_opts.mb_optimize_scan = DEFAULT_MB_OPTIMIZE_SCAN;
>>>>>>
>>>>>> if ((data && !orig_data) || !sbi)
>>>>>> goto out_free_base;
>>>>>> @@ -4984,6 +4997,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>>>>> ext4_fc_replay_cleanup(sb);
>>>>>>
>>>>>> ext4_ext_init(sb);
>>>>>> +
>>>>>> + /*
>>>>>> + * Enable optimize_scan if number of groups is > threshold. This can be
>>>>>> + * turned off by passing "mb_optimize_scan=0". This can also be
>>>>>> + * turned on forcefully by passing "mb_optimize_scan=1".
>>>>>> + */
>>>>>> + if (parsed_opts.mb_optimize_scan == 1)
>>>>>> + set_opt2(sb, MB_OPTIMIZE_SCAN);
>>>>>> + else if (parsed_opts.mb_optimize_scan == 0)
>>>>>> + clear_opt2(sb, MB_OPTIMIZE_SCAN);
>>>>>> + else if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD)
>>>>>> + set_opt2(sb, MB_OPTIMIZE_SCAN);
>>>>>> +
>>>>>> err = ext4_mb_init(sb);
>>>>>> if (err) {
>>>>>> ext4_msg(sb, KERN_ERR, "failed to initialize mballoc (%d)",
>>>>>> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
>>>>>> index 59ca9d73b42f..16b8a838f631 100644
>>>>>> --- a/fs/ext4/sysfs.c
>>>>>> +++ b/fs/ext4/sysfs.c
>>>>>> @@ -213,6 +213,7 @@ EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
>>>>>> EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
>>>>>> EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
>>>>>> EXT4_RW_ATTR_SBI_UI(mb_max_inode_prealloc, s_mb_max_inode_prealloc);
>>>>>> +EXT4_RW_ATTR_SBI_UI(mb_linear_limit, s_mb_linear_limit);
>>>>>> EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
>>>>>> EXT4_ATTR(trigger_fs_error, 0200, trigger_test_error);
>>>>>> EXT4_RW_ATTR_SBI_UI(err_ratelimit_interval_ms, s_err_ratelimit_state.interval);
>>>>>> @@ -260,6 +261,7 @@ static struct attribute *ext4_attrs[] = {
>>>>>> ATTR_LIST(mb_stream_req),
>>>>>> ATTR_LIST(mb_group_prealloc),
>>>>>> ATTR_LIST(mb_max_inode_prealloc),
>>>>>> + ATTR_LIST(mb_linear_limit),
>>>>>> ATTR_LIST(max_writeback_mb_bump),
>>>>>> ATTR_LIST(extent_max_zeroout_kb),
>>>>>> ATTR_LIST(trigger_fs_error),
>>>>>> --
>>>>>> 2.31.0.291.g576ba9dcdaf-goog
>>>>>>
>>>>>
>>>>>
>>>>> Cheers, Andreas
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>