2021-01-29 22:31:29

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH 0/4] Improve group scanning in CR 0 and CR 1 passes

This patch series improves cr 0 and cr 1 passes of the allocator
signficantly. Currently, at cr 0 and 1, we perform linear lookups to
find the matching groups. That's very inefficient for large file
systems where there are millions of block groups. At cr 0, we only
care about the groups that have the largest free order >= the
request's order and at cr 1 we only care about groups where average
fragment size > the request size. so, this patch introduces new data
structure that allow us to perform cr 0 lookup in constant time and cr
1 lookup in log (number of groups) time instead of linear.

For cr 0, we add a list for each order and all the groups are enqueued
to the appropriate list. This allows us to lookup a match at cr 0 in
constant time.

For cr 1, we add a new rb tree of groups sorted by largest fragment
size. This allows us to lookup a match for cr 1 in log (num groups)
time.

Verified that there are no regressions in smoke tests (-g quick -c 4k).

Also, to demonstrate the effectiveness for the patch series, following
experiment was performed:

Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:

time dd if=/dev/urandom of=file bs=2M count=10

Here are the results with and without cr 0/1 optimizations:

|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|

Signed-off-by: Harshad Shirwadkar <[email protected]>

Harshad Shirwadkar (4):
ext4: add MB_NUM_ORDERS macro
ext4: drop s_mb_bal_lock and convert protected fields to atomic
ext4: improve cr 0 / cr 1 group scanning
ext4: add proc files to monitor new structures

fs/ext4/ext4.h | 12 +-
fs/ext4/mballoc.c | 330 ++++++++++++++++++++++++++++++++++++++++++----
fs/ext4/mballoc.h | 6 +
fs/ext4/sysfs.c | 2 +
4 files changed, 324 insertions(+), 26 deletions(-)

--
2.30.0.365.g02bc693789-goog


2021-01-29 22:31:33

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH 3/4] ext4: improve cr 0 / cr 1 group scanning

Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain all the ext4_group_info structs in lists for each possible
order. During cr 0 allocation, we traverse these lists in the
increasing order of largest free orders. This allows us to find a
group with the best available cr 0 match in constant time. If nothing
can be found, we fallback to cr 1 immediately.

At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).

For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.

Signed-off-by: Harshad Shirwadkar <[email protected]>
---
fs/ext4/ext4.h | 6 ++
fs/ext4/mballoc.c | 223 ++++++++++++++++++++++++++++++++++++++++++++--
fs/ext4/mballoc.h | 1 +
3 files changed, 222 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6dd127942208..da12a083bf52 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1527,6 +1527,9 @@ struct ext4_sb_info {
unsigned int s_mb_free_pending;
struct list_head s_freed_data_list; /* List of blocks to be freed
after commit completed */
+ struct rb_root s_mb_avg_fragment_size_root;
+ struct list_head *s_mb_largest_free_orders;
+ rwlock_t s_mb_rb_lock;

/* tunables */
unsigned long s_stripe;
@@ -3304,11 +3307,14 @@ struct ext4_group_info {
ext4_grpblk_t bb_free; /* total free blocks */
ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */
+ ext4_group_t bb_group; /* Group number */
struct list_head bb_prealloc_list;
#ifdef DOUBLE_CHECK
void *bb_bitmap;
#endif
struct rw_semaphore alloc_sem;
+ struct rb_node bb_avg_fragment_size_rb;
+ struct list_head bb_largest_free_order_node;
ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
* regions, index is order.
* bb_counters[3] = 5 means
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 11c56b0e6f35..413259477b03 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -744,6 +744,193 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
}
}

+static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
+ int (*cmp)(struct rb_node *, struct rb_node *))
+{
+ struct rb_node **iter = &root->rb_node, *parent = NULL;
+
+ while (*iter) {
+ parent = *iter;
+ if (cmp(new, *iter))
+ iter = &((*iter)->rb_left);
+ else
+ iter = &((*iter)->rb_right);
+ }
+
+ rb_link_node(new, parent, iter);
+ rb_insert_color(new, root);
+}
+
+static int
+ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
+{
+ struct ext4_group_info *grp1 = rb_entry(rb1,
+ struct ext4_group_info,
+ bb_avg_fragment_size_rb);
+ struct ext4_group_info *grp2 = rb_entry(rb2,
+ struct ext4_group_info,
+ bb_avg_fragment_size_rb);
+ int num_frags_1, num_frags_2;
+
+ num_frags_1 = grp1->bb_fragments ?
+ grp1->bb_free / grp1->bb_fragments : 0;
+ num_frags_2 = grp2->bb_fragments ?
+ grp2->bb_free / grp2->bb_fragments : 0;
+
+ return (num_frags_1 < num_frags_2);
+}
+
+/*
+ * Reinsert grpinfo into the avg_fragment_size tree and into the appropriate
+ * largest_free_order list.
+ */
+static void
+ext4_mb_reinsert_grpinfo(struct super_block *sb, struct ext4_group_info *grp)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ write_lock(&sbi->s_mb_rb_lock);
+ if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
+ rb_erase(&grp->bb_avg_fragment_size_rb,
+ &sbi->s_mb_avg_fragment_size_root);
+ RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
+ }
+
+ ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
+ &grp->bb_avg_fragment_size_rb,
+ ext4_mb_avg_fragment_size_cmp);
+
+ list_del_init(&grp->bb_largest_free_order_node);
+ if (grp->bb_largest_free_order >= 0)
+ list_add(&grp->bb_largest_free_order_node,
+ &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
+ write_unlock(&sbi->s_mb_rb_lock);
+}
+
+/*
+ * ext4_mb_choose_next_group: choose next group for allocation.
+ *
+ * @ac Allocation Context
+ * @new_cr This is an output parameter. If the there is no good group available
+ * at current CR level, this field is updated to indicate the new cr
+ * level that should be used.
+ * @group This is an input / output parameter. As an input it indicates the last
+ * group used for allocation. As output, this field indicates the
+ * next group that should be used.
+ * @ngroups Total number of groups
+ */
+static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
+ int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ int avg_fragment_size, best_so_far, i;
+ struct rb_node *node, *found;
+ struct ext4_group_info *grp;
+
+ *new_cr = ac->ac_criteria;
+ if (*new_cr >= 2 ||
+ !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
+ goto inc_and_return;
+
+ /*
+ * If there is contention on the lock, instead of waiting for the lock
+ * to become available, just continue searching lineraly.
+ */
+ if (!read_trylock(&sbi->s_mb_rb_lock))
+ goto inc_and_return;
+
+ if (*new_cr == 0) {
+ grp = NULL;
+
+ if (ac->ac_status == AC_STATUS_FOUND)
+ goto inc_and_return;
+
+ for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
+ if (list_empty(&sbi->s_mb_largest_free_orders[i]))
+ continue;
+ grp = list_first_entry(&sbi->s_mb_largest_free_orders[i],
+ struct ext4_group_info,
+ bb_largest_free_order_node);
+ break;
+ }
+
+ if (grp) {
+ *group = grp->bb_group;
+ goto done;
+ }
+ /* Increment cr and search again */
+ *new_cr = 1;
+ }
+
+ /*
+ * At CR 1, if enough groups are not loaded, we just fallback to
+ * linear search
+ */
+ if (atomic_read(&sbi->s_mb_buddies_generated) <
+ ext4_get_groups_count(ac->ac_sb)) {
+ read_unlock(&sbi->s_mb_rb_lock);
+ goto inc_and_return;
+ }
+
+ if (*new_cr == 1) {
+ if (ac->ac_f_ex.fe_len > 0) {
+ /* We have found something at CR 1 in the past */
+ grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
+ found = rb_next(&grp->bb_avg_fragment_size_rb);
+ if (found) {
+ grp = rb_entry(found, struct ext4_group_info,
+ bb_avg_fragment_size_rb);
+ *group = grp->bb_group;
+ } else {
+ *new_cr = 2;
+ }
+ goto done;
+ }
+
+ /* This is the first time we are searching in the tree */
+ node = sbi->s_mb_avg_fragment_size_root.rb_node;
+ best_so_far = 0;
+ found = NULL;
+
+ while (node) {
+ grp = rb_entry(node, struct ext4_group_info,
+ bb_avg_fragment_size_rb);
+ avg_fragment_size = grp->bb_fragments ?
+ grp->bb_free / grp->bb_fragments : 0;
+ if (avg_fragment_size > ac->ac_g_ex.fe_len) {
+ if (!best_so_far || avg_fragment_size < best_so_far) {
+ best_so_far = avg_fragment_size;
+ found = node;
+ }
+ }
+ if (avg_fragment_size > ac->ac_g_ex.fe_len)
+ node = node->rb_right;
+ else
+ node = node->rb_left;
+ }
+ if (found) {
+ grp = rb_entry(found, struct ext4_group_info,
+ bb_avg_fragment_size_rb);
+ *group = grp->bb_group;
+ } else {
+ *new_cr = 2;
+ }
+ }
+done:
+ read_unlock(&sbi->s_mb_rb_lock);
+ ac->ac_last_optimal_group = *group;
+ return;
+
+inc_and_return:
+ /*
+ * Artificially restricted ngroups for non-extent
+ * files makes group > ngroups possible on first loop.
+ */
+ *group = *group + 1;
+ if (*group >= ngroups)
+ *group = 0;
+}
+
/*
* Cache the order of the largest free extent we have available in this block
* group.
@@ -818,6 +1005,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
period = get_cycles() - period;
atomic_inc(&sbi->s_mb_buddies_generated);
atomic64_add(period, &sbi->s_mb_generation_time);
+ ext4_mb_reinsert_grpinfo(sb, grp);
}

/* The buddy information is attached the buddy cache inode
@@ -1517,6 +1705,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,

done:
mb_set_largest_free_order(sb, e4b->bd_info);
+ ext4_mb_reinsert_grpinfo(sb, e4b->bd_info);
mb_check_buddy(e4b);
}

@@ -1653,6 +1842,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
}
mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);

+ ext4_mb_reinsert_grpinfo(e4b->bd_sb, e4b->bd_info);
ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
mb_check_buddy(e4b);

@@ -2345,17 +2535,20 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
* from the goal value specified
*/
group = ac->ac_g_ex.fe_group;
+ ac->ac_last_optimal_group = group;
prefetch_grp = group;

- for (i = 0; i < ngroups; group++, i++) {
- int ret = 0;
+ for (i = 0; i < ngroups; i++) {
+ int ret = 0, new_cr;
+
cond_resched();
- /*
- * Artificially restricted ngroups for non-extent
- * files makes group > ngroups possible on first loop.
- */
- if (group >= ngroups)
- group = 0;
+
+ ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
+
+ if (new_cr != cr) {
+ cr = new_cr;
+ goto repeat;
+ }

/*
* Batch reads of the block allocation bitmaps
@@ -2650,7 +2843,10 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
init_rwsem(&meta_group_info[i]->alloc_sem);
meta_group_info[i]->bb_free_root = RB_ROOT;
+ INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
+ RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb);
meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
+ meta_group_info[i]->bb_group = group;

mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
return 0;
@@ -2840,6 +3036,15 @@ int ext4_mb_init(struct super_block *sb)
i++;
} while (i < MB_NUM_ORDERS(sb));

+ sbi->s_mb_avg_fragment_size_root = RB_ROOT;
+ sbi->s_mb_largest_free_orders =
+ kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
+ GFP_KERNEL);
+ if (!sbi->s_mb_largest_free_orders)
+ goto out;
+ for (i = 0; i < MB_NUM_ORDERS(sb); i++)
+ INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
+ rwlock_init(&sbi->s_mb_rb_lock);

spin_lock_init(&sbi->s_md_lock);
sbi->s_mb_free_pending = 0;
@@ -2903,6 +3108,7 @@ int ext4_mb_init(struct super_block *sb)
free_percpu(sbi->s_locality_groups);
sbi->s_locality_groups = NULL;
out:
+ kfree(sbi->s_mb_largest_free_orders);
kfree(sbi->s_mb_offsets);
sbi->s_mb_offsets = NULL;
kfree(sbi->s_mb_maxs);
@@ -2959,6 +3165,7 @@ int ext4_mb_release(struct super_block *sb)
kvfree(group_info);
rcu_read_unlock();
}
+ kfree(sbi->s_mb_largest_free_orders);
kfree(sbi->s_mb_offsets);
kfree(sbi->s_mb_maxs);
iput(sbi->s_buddy_cache);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 68111a10cfee..57b44c7320b2 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -166,6 +166,7 @@ struct ext4_allocation_context {
/* copy of the best found extent taken before preallocation efforts */
struct ext4_free_extent ac_f_ex;

+ ext4_group_t ac_last_optimal_group;
__u16 ac_groups_scanned;
__u16 ac_found;
__u16 ac_tail;
--
2.30.0.365.g02bc693789-goog

2021-01-29 22:31:40

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH 4/4] ext4: add proc files to monitor new structures

This patch adds a new file "mb_structs" which allows us to see the
largest free order lists and the serialized average fragment tree.

Signed-off-by: Harshad Shirwadkar <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/mballoc.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/sysfs.c | 2 ++
3 files changed, 82 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index da12a083bf52..835e304e3113 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2809,6 +2809,7 @@ int __init ext4_fc_init_dentry_cache(void);

/* mballoc.c */
extern const struct seq_operations ext4_mb_seq_groups_ops;
+extern const struct seq_operations ext4_mb_seq_structs_ops;
extern long ext4_mb_stats;
extern long ext4_mb_max_to_scan;
extern int ext4_mb_init(struct super_block *);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 413259477b03..ec4656903bd4 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2741,6 +2741,85 @@ const struct seq_operations ext4_mb_seq_groups_ops = {
.show = ext4_mb_seq_groups_show,
};

+static void *ext4_mb_seq_structs_start(struct seq_file *seq, loff_t *pos)
+{
+ struct super_block *sb = PDE_DATA(file_inode(seq->file));
+ unsigned long position;
+
+ read_lock(&EXT4_SB(sb)->s_mb_rb_lock);
+
+ if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + ext4_get_groups_count(sb))
+ return NULL;
+ position = *pos + 1;
+ return (void *) ((unsigned long) position);
+}
+
+static void *ext4_mb_seq_structs_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ struct super_block *sb = PDE_DATA(file_inode(seq->file));
+ unsigned long position;
+
+ ++*pos;
+ if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + ext4_get_groups_count(sb))
+ return NULL;
+ position = *pos + 1;
+ return (void *) ((unsigned long) position);
+}
+
+static int ext4_mb_seq_structs_show(struct seq_file *seq, void *v)
+{
+ struct super_block *sb = PDE_DATA(file_inode(seq->file));
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ unsigned long position = ((unsigned long) v);
+ struct ext4_group_info *grpinfo;
+ struct rb_node *n;
+ int i;
+
+ position--;
+
+ if (position >= MB_NUM_ORDERS(sb)) {
+ position -= MB_NUM_ORDERS(sb);
+ if (position == 0)
+ seq_puts(seq, "Group, Avg Fragment Size\n");
+ n = rb_first(&sbi->s_mb_avg_fragment_size_root);
+ for (i = 0; n && i < position; i++)
+ n = rb_next(n);
+ if (!n)
+ return 0;
+ grpinfo = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb);
+ seq_printf(seq, "%d, %d\n",
+ grpinfo->bb_group,
+ grpinfo->bb_fragments ? grpinfo->bb_free / grpinfo->bb_fragments : 0);
+ return 0;
+ }
+
+ if (position == 0)
+ seq_puts(seq, "Largest Free Order Lists:\n");
+
+ seq_printf(seq, "[%ld]: ", position);
+ list_for_each_entry(grpinfo, &sbi->s_mb_largest_free_orders[position],
+ bb_largest_free_order_node) {
+ seq_printf(seq, "%d ", grpinfo->bb_group);
+ }
+ seq_puts(seq, "\n");
+
+ return 0;
+}
+
+static void ext4_mb_seq_structs_stop(struct seq_file *seq, void *v)
+{
+ struct super_block *sb = PDE_DATA(file_inode(seq->file));
+
+ read_unlock(&EXT4_SB(sb)->s_mb_rb_lock);
+}
+
+const struct seq_operations ext4_mb_seq_structs_ops = {
+ .start = ext4_mb_seq_structs_start,
+ .next = ext4_mb_seq_structs_next,
+ .stop = ext4_mb_seq_structs_stop,
+ .show = ext4_mb_seq_structs_show,
+};
+
static struct kmem_cache *get_groupinfo_cache(int blocksize_bits)
{
int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE;
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 4e27fe6ed3ae..828d58b98310 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -527,6 +527,8 @@ int ext4_register_sysfs(struct super_block *sb)
ext4_fc_info_show, sb);
proc_create_seq_data("mb_groups", S_IRUGO, sbi->s_proc,
&ext4_mb_seq_groups_ops, sb);
+ proc_create_seq_data("mb_structs", 0444, sbi->s_proc,
+ &ext4_mb_seq_structs_ops, sb);
}
return 0;
}
--
2.30.0.365.g02bc693789-goog

2021-01-29 22:33:47

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH 2/4] ext4: drop s_mb_bal_lock and convert protected fields to atomic

s_mb_buddies_generated gets used later in this patch series to
determine if the cr 0 and cr 1 optimziations should be performed or
not. Currently, s_mb_buddies_generated is protected under a
spin_lock. In the allocation path, it is better if we don't depend on
the lock and instead read the value atomically. In order to do that,
we drop s_bal_lock altogether and we convert the only two protected
fields by it s_mb_buddies_generated and s_mb_generation_time to atomic
type.

Signed-off-by: Harshad Shirwadkar <[email protected]>
---
fs/ext4/ext4.h | 5 ++---
fs/ext4/mballoc.c | 13 +++++--------
2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 64f25ea2fa7a..6dd127942208 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1552,9 +1552,8 @@ struct ext4_sb_info {
atomic_t s_bal_goals; /* goal hits */
atomic_t s_bal_breaks; /* too long searches */
atomic_t s_bal_2orders; /* 2^order hits */
- spinlock_t s_bal_lock;
- unsigned long s_mb_buddies_generated;
- unsigned long long s_mb_generation_time;
+ atomic_t s_mb_buddies_generated; /* number of buddies generated */
+ atomic64_t s_mb_generation_time;
atomic_t s_mb_lost_chunks;
atomic_t s_mb_preallocated;
atomic_t s_mb_discarded;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 625242e5c683..11c56b0e6f35 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -816,10 +816,8 @@ void ext4_mb_generate_buddy(struct super_block *sb,
clear_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state));

period = get_cycles() - period;
- spin_lock(&sbi->s_bal_lock);
- sbi->s_mb_buddies_generated++;
- sbi->s_mb_generation_time += period;
- spin_unlock(&sbi->s_bal_lock);
+ atomic_inc(&sbi->s_mb_buddies_generated);
+ atomic64_add(period, &sbi->s_mb_generation_time);
}

/* The buddy information is attached the buddy cache inode
@@ -2844,7 +2842,6 @@ int ext4_mb_init(struct super_block *sb)


spin_lock_init(&sbi->s_md_lock);
- spin_lock_init(&sbi->s_bal_lock);
sbi->s_mb_free_pending = 0;
INIT_LIST_HEAD(&sbi->s_freed_data_list);

@@ -2980,9 +2977,9 @@ int ext4_mb_release(struct super_block *sb)
atomic_read(&sbi->s_bal_breaks),
atomic_read(&sbi->s_mb_lost_chunks));
ext4_msg(sb, KERN_INFO,
- "mballoc: %lu generated and it took %Lu",
- sbi->s_mb_buddies_generated,
- sbi->s_mb_generation_time);
+ "mballoc: %u generated and it took %llu",
+ atomic_read(&sbi->s_mb_buddies_generated),
+ atomic64_read(&sbi->s_mb_generation_time));
ext4_msg(sb, KERN_INFO,
"mballoc: %u preallocated, %u discarded",
atomic_read(&sbi->s_mb_preallocated),
--
2.30.0.365.g02bc693789-goog

2021-01-29 22:33:47

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH 1/4] ext4: add MB_NUM_ORDERS macro

A few arrays in mballoc.c use the total number of valid orders as
their size. Currently, this value is set as "sb->s_blocksize_bits +
2". This makes code harder to read. So, instead add a new macro
MB_NUM_ORDERS(sb) to make the code more readable.

Signed-off-by: Harshad Shirwadkar <[email protected]>
---
fs/ext4/mballoc.c | 15 ++++++++-------
fs/ext4/mballoc.h | 5 +++++
2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 99bf091fee10..625242e5c683 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -756,7 +756,7 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)

grp->bb_largest_free_order = -1; /* uninit */

- bits = sb->s_blocksize_bits + 1;
+ bits = MB_NUM_ORDERS(sb) - 1;
for (i = bits; i >= 0; i--) {
if (grp->bb_counters[i] > 0) {
grp->bb_largest_free_order = i;
@@ -1930,7 +1930,7 @@ void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac,
int max;

BUG_ON(ac->ac_2order <= 0);
- for (i = ac->ac_2order; i <= sb->s_blocksize_bits + 1; i++) {
+ for (i = ac->ac_2order; i < MB_NUM_ORDERS(sb); i++) {
if (grp->bb_counters[i] == 0)
continue;

@@ -2315,13 +2315,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
* We also support searching for power-of-two requests only for
* requests upto maximum buddy size we have constructed.
*/
- if (i >= sbi->s_mb_order2_reqs && i <= sb->s_blocksize_bits + 2) {
+ if (i >= sbi->s_mb_order2_reqs && i <= MB_NUM_ORDERS(sb)) {
/*
* This should tell if fe_len is exactly power of 2
*/
if ((ac->ac_g_ex.fe_len & (~(1 << (i - 1)))) == 0)
ac->ac_2order = array_index_nospec(i - 1,
- sb->s_blocksize_bits + 2);
+ MB_NUM_ORDERS(sb));
}

/* if stream allocation is enabled, use global goal */
@@ -2806,7 +2806,7 @@ int ext4_mb_init(struct super_block *sb)
unsigned max;
int ret;

- i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_offsets);
+ i = MB_NUM_ORDERS(sb) * sizeof(*sbi->s_mb_offsets);

sbi->s_mb_offsets = kmalloc(i, GFP_KERNEL);
if (sbi->s_mb_offsets == NULL) {
@@ -2814,7 +2814,7 @@ int ext4_mb_init(struct super_block *sb)
goto out;
}

- i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_maxs);
+ i = MB_NUM_ORDERS(sb) * sizeof(*sbi->s_mb_maxs);
sbi->s_mb_maxs = kmalloc(i, GFP_KERNEL);
if (sbi->s_mb_maxs == NULL) {
ret = -ENOMEM;
@@ -2840,7 +2840,8 @@ int ext4_mb_init(struct super_block *sb)
offset_incr = offset_incr >> 1;
max = max >> 1;
i++;
- } while (i <= sb->s_blocksize_bits + 1);
+ } while (i < MB_NUM_ORDERS(sb));
+

spin_lock_init(&sbi->s_md_lock);
spin_lock_init(&sbi->s_bal_lock);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index e75b4749aa1c..68111a10cfee 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -78,6 +78,11 @@
*/
#define MB_DEFAULT_MAX_INODE_PREALLOC 512

+/*
+ * Number of valid buddy orders
+ */
+#define MB_NUM_ORDERS(sb) ((sb)->s_blocksize_bits + 2)
+
struct ext4_free_data {
/* this links the free block information from sb_info */
struct list_head efd_list;
--
2.30.0.365.g02bc693789-goog

2021-01-30 09:22:53

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 1/4] ext4: add MB_NUM_ORDERS macro

On Jan 29, 2021, at 3:29 PM, Harshad Shirwadkar <[email protected]> wrote:
> A few arrays in mballoc.c use the total number of valid orders as
> their size. Currently, this value is set as "sb->s_blocksize_bits +
> 2". This makes code harder to read. So, instead add a new macro
> MB_NUM_ORDERS(sb) to make the code more readable.
>
> Signed-off-by: Harshad Shirwadkar <[email protected]>

There were a few cases that _looked_ incorrect, because MB_NUM_ORDERS(sb)
was replacing "sb->s_blocksize_bits + 1", but they also changed "<=" to "<"
so they appear to be correct...

Reviewed-by: Andreas Dilger <[email protected]>

> ---
> fs/ext4/mballoc.c | 15 ++++++++-------
> fs/ext4/mballoc.h | 5 +++++
> 2 files changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 99bf091fee10..625242e5c683 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -756,7 +756,7 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
>
> grp->bb_largest_free_order = -1; /* uninit */
>
> - bits = sb->s_blocksize_bits + 1;
> + bits = MB_NUM_ORDERS(sb) - 1;
> for (i = bits; i >= 0; i--) {
> if (grp->bb_counters[i] > 0) {
> grp->bb_largest_free_order = i;
> @@ -1930,7 +1930,7 @@ void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac,
> int max;
>
> BUG_ON(ac->ac_2order <= 0);
> - for (i = ac->ac_2order; i <= sb->s_blocksize_bits + 1; i++) {
> + for (i = ac->ac_2order; i < MB_NUM_ORDERS(sb); i++) {
> if (grp->bb_counters[i] == 0)
> continue;
>
> @@ -2315,13 +2315,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> * We also support searching for power-of-two requests only for
> * requests upto maximum buddy size we have constructed.
> */
> - if (i >= sbi->s_mb_order2_reqs && i <= sb->s_blocksize_bits + 2) {
> + if (i >= sbi->s_mb_order2_reqs && i <= MB_NUM_ORDERS(sb)) {
> /*
> * This should tell if fe_len is exactly power of 2
> */
> if ((ac->ac_g_ex.fe_len & (~(1 << (i - 1)))) == 0)
> ac->ac_2order = array_index_nospec(i - 1,
> - sb->s_blocksize_bits + 2);
> + MB_NUM_ORDERS(sb));
> }
>
> /* if stream allocation is enabled, use global goal */
> @@ -2806,7 +2806,7 @@ int ext4_mb_init(struct super_block *sb)
> unsigned max;
> int ret;
>
> - i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_offsets);
> + i = MB_NUM_ORDERS(sb) * sizeof(*sbi->s_mb_offsets);
>
> sbi->s_mb_offsets = kmalloc(i, GFP_KERNEL);
> if (sbi->s_mb_offsets == NULL) {
> @@ -2814,7 +2814,7 @@ int ext4_mb_init(struct super_block *sb)
> goto out;
> }
>
> - i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_maxs);
> + i = MB_NUM_ORDERS(sb) * sizeof(*sbi->s_mb_maxs);
> sbi->s_mb_maxs = kmalloc(i, GFP_KERNEL);
> if (sbi->s_mb_maxs == NULL) {
> ret = -ENOMEM;
> @@ -2840,7 +2840,8 @@ int ext4_mb_init(struct super_block *sb)
> offset_incr = offset_incr >> 1;
> max = max >> 1;
> i++;
> - } while (i <= sb->s_blocksize_bits + 1);
> + } while (i < MB_NUM_ORDERS(sb));
> +
>
> spin_lock_init(&sbi->s_md_lock);
> spin_lock_init(&sbi->s_bal_lock);
> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> index e75b4749aa1c..68111a10cfee 100644
> --- a/fs/ext4/mballoc.h
> +++ b/fs/ext4/mballoc.h
> @@ -78,6 +78,11 @@
> */
> #define MB_DEFAULT_MAX_INODE_PREALLOC 512
>
> +/*
> + * Number of valid buddy orders
> + */
> +#define MB_NUM_ORDERS(sb) ((sb)->s_blocksize_bits + 2)
> +
> struct ext4_free_data {
> /* this links the free block information from sb_info */
> struct list_head efd_list;
> --
> 2.30.0.365.g02bc693789-goog
>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2021-01-30 09:23:08

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 0/4] Improve group scanning in CR 0 and CR 1 passes

On Jan 29, 2021, at 3:29 PM, Harshad Shirwadkar <[email protected]> wrote:
> This patch series improves cr 0 and cr 1 passes of the allocator
> signficantly. Currently, at cr 0 and 1, we perform linear lookups to
> find the matching groups. That's very inefficient for large file
> systems where there are millions of block groups. At cr 0, we only
> care about the groups that have the largest free order >= the
> request's order and at cr 1 we only care about groups where average
> fragment size > the request size. so, this patch introduces new data
> structure that allow us to perform cr 0 lookup in constant time and cr
> 1 lookup in log (number of groups) time instead of linear.

I've added Alex and Artem to the CC list, since they've both been working
on this code recently and hopefully they can review these patches. Also,
I've added Shuichi in the hopes that he can give this a test on a very
large filesystem to see what kind of improvement it shows.

Cheers, Andreas

> For cr 0, we add a list for each order and all the groups are enqueued
> to the appropriate list. This allows us to lookup a match at cr 0 in
> constant time.
>
> For cr 1, we add a new rb tree of groups sorted by largest fragment
> size. This allows us to lookup a match for cr 1 in log (num groups)
> time.
>
> Verified that there are no regressions in smoke tests (-g quick -c 4k).
>
> Also, to demonstrate the effectiveness for the patch series, following
> experiment was performed:
>
> Created a highly fragmented disk of size 65TB. The disk had no
> contiguous 2M regions. Following command was run consecutively for 3
> times:
>
> time dd if=/dev/urandom of=file bs=2M count=10
>
> Here are the results with and without cr 0/1 optimizations:
>
> |---------+------------------------------+---------------------------|
> | | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
> |---------+------------------------------+---------------------------|
> | 1st run | 5m1.871s | 2m47.642s |
> | 2nd run | 2m28.390s | 0m0.611s |
> | 3rd run | 2m26.530s | 0m1.255s |
> |---------+------------------------------+---------------------------|
>
> Signed-off-by: Harshad Shirwadkar <[email protected]>
>
> Harshad Shirwadkar (4):
> ext4: add MB_NUM_ORDERS macro
> ext4: drop s_mb_bal_lock and convert protected fields to atomic
> ext4: improve cr 0 / cr 1 group scanning
> ext4: add proc files to monitor new structures
>
> fs/ext4/ext4.h | 12 +-
> fs/ext4/mballoc.c | 330 ++++++++++++++++++++++++++++++++++++++++++----
> fs/ext4/mballoc.h | 6 +
> fs/ext4/sysfs.c | 2 +
> 4 files changed, 324 insertions(+), 26 deletions(-)
>
> --
> 2.30.0.365.g02bc693789-goog
>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2021-01-30 09:24:37

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 2/4] ext4: drop s_mb_bal_lock and convert protected fields to atomic

On Jan 29, 2021, at 3:29 PM, Harshad Shirwadkar <[email protected]> wrote:
>
> s_mb_buddies_generated gets used later in this patch series to
> determine if the cr 0 and cr 1 optimziations should be performed or
> not. Currently, s_mb_buddies_generated is protected under a
> spin_lock. In the allocation path, it is better if we don't depend on
> the lock and instead read the value atomically. In order to do that,
> we drop s_bal_lock altogether and we convert the only two protected
> fields by it s_mb_buddies_generated and s_mb_generation_time to atomic
> type.
>
> Signed-off-by: Harshad Shirwadkar <[email protected]>

Reviewed-by: Andreas Dilger <[email protected]>

> ---
> fs/ext4/ext4.h | 5 ++---
> fs/ext4/mballoc.c | 13 +++++--------
> 2 files changed, 7 insertions(+), 11 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 64f25ea2fa7a..6dd127942208 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1552,9 +1552,8 @@ struct ext4_sb_info {
> atomic_t s_bal_goals; /* goal hits */
> atomic_t s_bal_breaks; /* too long searches */
> atomic_t s_bal_2orders; /* 2^order hits */
> - spinlock_t s_bal_lock;
> - unsigned long s_mb_buddies_generated;
> - unsigned long long s_mb_generation_time;
> + atomic_t s_mb_buddies_generated; /* number of buddies generated */
> + atomic64_t s_mb_generation_time;
> atomic_t s_mb_lost_chunks;
> atomic_t s_mb_preallocated;
> atomic_t s_mb_discarded;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 625242e5c683..11c56b0e6f35 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -816,10 +816,8 @@ void ext4_mb_generate_buddy(struct super_block *sb,
> clear_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state));
>
> period = get_cycles() - period;
> - spin_lock(&sbi->s_bal_lock);
> - sbi->s_mb_buddies_generated++;
> - sbi->s_mb_generation_time += period;
> - spin_unlock(&sbi->s_bal_lock);
> + atomic_inc(&sbi->s_mb_buddies_generated);
> + atomic64_add(period, &sbi->s_mb_generation_time);
> }
>
> /* The buddy information is attached the buddy cache inode
> @@ -2844,7 +2842,6 @@ int ext4_mb_init(struct super_block *sb)
>
>
> spin_lock_init(&sbi->s_md_lock);
> - spin_lock_init(&sbi->s_bal_lock);
> sbi->s_mb_free_pending = 0;
> INIT_LIST_HEAD(&sbi->s_freed_data_list);
>
> @@ -2980,9 +2977,9 @@ int ext4_mb_release(struct super_block *sb)
> atomic_read(&sbi->s_bal_breaks),
> atomic_read(&sbi->s_mb_lost_chunks));
> ext4_msg(sb, KERN_INFO,
> - "mballoc: %lu generated and it took %Lu",
> - sbi->s_mb_buddies_generated,
> - sbi->s_mb_generation_time);
> + "mballoc: %u generated and it took %llu",
> + atomic_read(&sbi->s_mb_buddies_generated),
> + atomic64_read(&sbi->s_mb_generation_time));
> ext4_msg(sb, KERN_INFO,
> "mballoc: %u preallocated, %u discarded",
> atomic_read(&sbi->s_mb_preallocated),
> --
> 2.30.0.365.g02bc693789-goog
>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2021-01-30 10:21:49

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 3/4] ext4: improve cr 0 / cr 1 group scanning

On Jan 29, 2021, at 3:29 PM, Harshad Shirwadkar <[email protected]> wrote:
>
> Instead of traversing through groups linearly,

For linear IO writes to a file, before checking any of the lists,
it makes sense to still check the current and next group to see
if they have free extents of the proper size? That would preserve
linearity of writes to a file in the common few-threads-writing case.

That would reduce contention on the list locks (as you mention later)
compared to *every* thread checking the list each time, as well as
improve the sequential allocation for files. The cr0/cr1 list/tree
would only be used for finding a new allocation group if the current
and next groups fail.

> scan groups in specific
> orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
> largest free order >= the order of the request. So, with this patch,
> we maintain all the ext4_group_info structs in lists for each possible
> order.

This would be more clear like (if I'm describing this correctly):

we maintain lists for each possible order and insert each group
into a list based on the largest free order in its buddy bitmap.

> During cr 0 allocation, we traverse these lists in the
> increasing order of largest free orders. This allows us to find a
> group with the best available cr 0 match in constant time. If nothing
> can be found, we fallback to cr 1 immediately.

Just to clarify, if allocating, say, an 8MB extent, this would jump
straight to the 8MB group list to look for groups that have free
extents in their 8MB buddy bitmap, and if that list is empty it will
go to the 16MB list to look for a group, repeat as necessary until an
extent is found (and split) or there are no larger extents and the
cr0 pass is finished?

> At CR1, the story is slightly different. We want to traverse in the
> order of increasing average fragment size. For CR1, we maintain a rb
> tree of groupinfos which is sorted by average fragment size. Instead
> of traversing linearly, at CR1, we traverse in the order of increasing
> average fragment size, starting at the most optimal group. This brings
> down cr 1 search complexity to log(num groups).
>
> For cr >= 2, we just perform the linear search as before. Also, in
> case of lock contention, we intermittently fallback to linear search
> even in CR 0 and CR 1 cases. This allows us to proceed during the
> allocation path even in case of high contention.

Did you look at separate locks for each order of the cr0 list to
reduce contention? It might be possible to opportunistically avoid
locking any of the list heads when *checking* if the list is empty
or not. I don't think there are serious issues with races if the
list is already empty.

Having multiple list locks is fine if there is a specific locking
order (larger to smaller) when multiple locks are taken. If there
is a group with the correct order it only needs a single list lock,
and if a higher order group is needed (requiring the extent to be
split) then the next lower list would be locked to insert the group,
which can be taken without dropping the higher-order list lock.

When inserting groups back into the list, assuming that each thread
will stick with the same group for some number of allocations (rather
than having to re-lookup the same group each time), does it make more
sense to put the group at the *end* of the list so that the next
thread doing a lookup will find a different group than another thread
was just allocating from?

Is there any particular requirement why the rb lock and the list
lock(s) need to be the same lock? I'd imagine that they do not,
since a thread will be using either one list or the other. Making
these separate locks and reducing the lock hold time would allow
more threads to be doing useful work instead of spinning.

Cheers, Andreas

> Signed-off-by: Harshad Shirwadkar <[email protected]>
> ---
> fs/ext4/ext4.h | 6 ++
> fs/ext4/mballoc.c | 223 ++++++++++++++++++++++++++++++++++++++++++++--
> fs/ext4/mballoc.h | 1 +
> 3 files changed, 222 insertions(+), 8 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 6dd127942208..da12a083bf52 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1527,6 +1527,9 @@ struct ext4_sb_info {
> unsigned int s_mb_free_pending;
> struct list_head s_freed_data_list; /* List of blocks to be freed
> after commit completed */
> + struct rb_root s_mb_avg_fragment_size_root;
> + struct list_head *s_mb_largest_free_orders;
> + rwlock_t s_mb_rb_lock;
>
> /* tunables */
> unsigned long s_stripe;
> @@ -3304,11 +3307,14 @@ struct ext4_group_info {
> ext4_grpblk_t bb_free; /* total free blocks */
> ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
> ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */
> + ext4_group_t bb_group; /* Group number */
> struct list_head bb_prealloc_list;
> #ifdef DOUBLE_CHECK
> void *bb_bitmap;
> #endif
> struct rw_semaphore alloc_sem;
> + struct rb_node bb_avg_fragment_size_rb;
> + struct list_head bb_largest_free_order_node;
> ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
> * regions, index is order.
> * bb_counters[3] = 5 means
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 11c56b0e6f35..413259477b03 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -744,6 +744,193 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> }
> }
>
> +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
> + int (*cmp)(struct rb_node *, struct rb_node *))
> +{
> + struct rb_node **iter = &root->rb_node, *parent = NULL;
> +
> + while (*iter) {
> + parent = *iter;
> + if (cmp(new, *iter))
> + iter = &((*iter)->rb_left);
> + else
> + iter = &((*iter)->rb_right);
> + }
> +
> + rb_link_node(new, parent, iter);
> + rb_insert_color(new, root);
> +}
> +
> +static int
> +ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
> +{
> + struct ext4_group_info *grp1 = rb_entry(rb1,
> + struct ext4_group_info,
> + bb_avg_fragment_size_rb);
> + struct ext4_group_info *grp2 = rb_entry(rb2,
> + struct ext4_group_info,
> + bb_avg_fragment_size_rb);
> + int num_frags_1, num_frags_2;
> +
> + num_frags_1 = grp1->bb_fragments ?
> + grp1->bb_free / grp1->bb_fragments : 0;
> + num_frags_2 = grp2->bb_fragments ?
> + grp2->bb_free / grp2->bb_fragments : 0;
> +
> + return (num_frags_1 < num_frags_2);
> +}
> +
> +/*
> + * Reinsert grpinfo into the avg_fragment_size tree and into the appropriate
> + * largest_free_order list.
> + */
> +static void
> +ext4_mb_reinsert_grpinfo(struct super_block *sb, struct ext4_group_info *grp)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> +
> + write_lock(&sbi->s_mb_rb_lock);
> + if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
> + rb_erase(&grp->bb_avg_fragment_size_rb,
> + &sbi->s_mb_avg_fragment_size_root);
> + RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
> + }
> +
> + ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
> + &grp->bb_avg_fragment_size_rb,
> + ext4_mb_avg_fragment_size_cmp);
> +
> + list_del_init(&grp->bb_largest_free_order_node);
> + if (grp->bb_largest_free_order >= 0)
> + list_add(&grp->bb_largest_free_order_node,
> + &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> + write_unlock(&sbi->s_mb_rb_lock);
> +}
> +
> +/*
> + * ext4_mb_choose_next_group: choose next group for allocation.
> + *
> + * @ac Allocation Context
> + * @new_cr This is an output parameter. If the there is no good group available
> + * at current CR level, this field is updated to indicate the new cr
> + * level that should be used.
> + * @group This is an input / output parameter. As an input it indicates the last
> + * group used for allocation. As output, this field indicates the
> + * next group that should be used.
> + * @ngroups Total number of groups
> + */
> +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + int avg_fragment_size, best_so_far, i;
> + struct rb_node *node, *found;
> + struct ext4_group_info *grp;
> +
> + *new_cr = ac->ac_criteria;
> + if (*new_cr >= 2 ||
> + !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
> + goto inc_and_return;
> +
> + /*
> + * If there is contention on the lock, instead of waiting for the lock
> + * to become available, just continue searching lineraly.
> + */
> + if (!read_trylock(&sbi->s_mb_rb_lock))
> + goto inc_and_return;
> +
> + if (*new_cr == 0) {
> + grp = NULL;
> +
> + if (ac->ac_status == AC_STATUS_FOUND)
> + goto inc_and_return;
> +
> + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> + if (list_empty(&sbi->s_mb_largest_free_orders[i]))
> + continue;
> + grp = list_first_entry(&sbi->s_mb_largest_free_orders[i],
> + struct ext4_group_info,
> + bb_largest_free_order_node);
> + break;
> + }
> +
> + if (grp) {
> + *group = grp->bb_group;
> + goto done;
> + }
> + /* Increment cr and search again */
> + *new_cr = 1;
> + }
> +
> + /*
> + * At CR 1, if enough groups are not loaded, we just fallback to
> + * linear search
> + */
> + if (atomic_read(&sbi->s_mb_buddies_generated) <
> + ext4_get_groups_count(ac->ac_sb)) {
> + read_unlock(&sbi->s_mb_rb_lock);
> + goto inc_and_return;
> + }
> +
> + if (*new_cr == 1) {
> + if (ac->ac_f_ex.fe_len > 0) {
> + /* We have found something at CR 1 in the past */
> + grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
> + found = rb_next(&grp->bb_avg_fragment_size_rb);
> + if (found) {
> + grp = rb_entry(found, struct ext4_group_info,
> + bb_avg_fragment_size_rb);
> + *group = grp->bb_group;
> + } else {
> + *new_cr = 2;
> + }
> + goto done;
> + }
> +
> + /* This is the first time we are searching in the tree */
> + node = sbi->s_mb_avg_fragment_size_root.rb_node;
> + best_so_far = 0;
> + found = NULL;
> +
> + while (node) {
> + grp = rb_entry(node, struct ext4_group_info,
> + bb_avg_fragment_size_rb);
> + avg_fragment_size = grp->bb_fragments ?
> + grp->bb_free / grp->bb_fragments : 0;
> + if (avg_fragment_size > ac->ac_g_ex.fe_len) {
> + if (!best_so_far || avg_fragment_size < best_so_far) {
> + best_so_far = avg_fragment_size;
> + found = node;
> + }
> + }
> + if (avg_fragment_size > ac->ac_g_ex.fe_len)
> + node = node->rb_right;
> + else
> + node = node->rb_left;
> + }
> + if (found) {
> + grp = rb_entry(found, struct ext4_group_info,
> + bb_avg_fragment_size_rb);
> + *group = grp->bb_group;
> + } else {
> + *new_cr = 2;
> + }
> + }
> +done:
> + read_unlock(&sbi->s_mb_rb_lock);
> + ac->ac_last_optimal_group = *group;
> + return;
> +
> +inc_and_return:
> + /*
> + * Artificially restricted ngroups for non-extent
> + * files makes group > ngroups possible on first loop.
> + */
> + *group = *group + 1;
> + if (*group >= ngroups)
> + *group = 0;
> +}
> +
> /*
> * Cache the order of the largest free extent we have available in this block
> * group.
> @@ -818,6 +1005,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
> period = get_cycles() - period;
> atomic_inc(&sbi->s_mb_buddies_generated);
> atomic64_add(period, &sbi->s_mb_generation_time);
> + ext4_mb_reinsert_grpinfo(sb, grp);
> }
>
> /* The buddy information is attached the buddy cache inode
> @@ -1517,6 +1705,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
>
> done:
> mb_set_largest_free_order(sb, e4b->bd_info);
> + ext4_mb_reinsert_grpinfo(sb, e4b->bd_info);
> mb_check_buddy(e4b);
> }
>
> @@ -1653,6 +1842,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
> }
> mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
>
> + ext4_mb_reinsert_grpinfo(e4b->bd_sb, e4b->bd_info);
> ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
> mb_check_buddy(e4b);
>
> @@ -2345,17 +2535,20 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> * from the goal value specified
> */
> group = ac->ac_g_ex.fe_group;
> + ac->ac_last_optimal_group = group;
> prefetch_grp = group;
>
> - for (i = 0; i < ngroups; group++, i++) {
> - int ret = 0;
> + for (i = 0; i < ngroups; i++) {
> + int ret = 0, new_cr;
> +
> cond_resched();
> - /*
> - * Artificially restricted ngroups for non-extent
> - * files makes group > ngroups possible on first loop.
> - */
> - if (group >= ngroups)
> - group = 0;
> +
> + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
> +
> + if (new_cr != cr) {
> + cr = new_cr;
> + goto repeat;
> + }
>
> /*
> * Batch reads of the block allocation bitmaps
> @@ -2650,7 +2843,10 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
> INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
> init_rwsem(&meta_group_info[i]->alloc_sem);
> meta_group_info[i]->bb_free_root = RB_ROOT;
> + INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
> + RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb);
> meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
> + meta_group_info[i]->bb_group = group;
>
> mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
> return 0;
> @@ -2840,6 +3036,15 @@ int ext4_mb_init(struct super_block *sb)
> i++;
> } while (i < MB_NUM_ORDERS(sb));
>
> + sbi->s_mb_avg_fragment_size_root = RB_ROOT;
> + sbi->s_mb_largest_free_orders =
> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> + GFP_KERNEL);
> + if (!sbi->s_mb_largest_free_orders)
> + goto out;
> + for (i = 0; i < MB_NUM_ORDERS(sb); i++)
> + INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
> + rwlock_init(&sbi->s_mb_rb_lock);
>
> spin_lock_init(&sbi->s_md_lock);
> sbi->s_mb_free_pending = 0;
> @@ -2903,6 +3108,7 @@ int ext4_mb_init(struct super_block *sb)
> free_percpu(sbi->s_locality_groups);
> sbi->s_locality_groups = NULL;
> out:
> + kfree(sbi->s_mb_largest_free_orders);
> kfree(sbi->s_mb_offsets);
> sbi->s_mb_offsets = NULL;
> kfree(sbi->s_mb_maxs);
> @@ -2959,6 +3165,7 @@ int ext4_mb_release(struct super_block *sb)
> kvfree(group_info);
> rcu_read_unlock();
> }
> + kfree(sbi->s_mb_largest_free_orders);
> kfree(sbi->s_mb_offsets);
> kfree(sbi->s_mb_maxs);
> iput(sbi->s_buddy_cache);
> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> index 68111a10cfee..57b44c7320b2 100644
> --- a/fs/ext4/mballoc.h
> +++ b/fs/ext4/mballoc.h
> @@ -166,6 +166,7 @@ struct ext4_allocation_context {
> /* copy of the best found extent taken before preallocation efforts */
> struct ext4_free_extent ac_f_ex;
>
> + ext4_group_t ac_last_optimal_group;
> __u16 ac_groups_scanned;
> __u16 ac_found;
> __u16 ac_tail;
> --
> 2.30.0.365.g02bc693789-goog
>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2021-01-30 11:00:59

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: add proc files to monitor new structures

On Jan 29, 2021, at 3:29 PM, Harshad Shirwadkar <[email protected]> wrote:
>
> This patch adds a new file "mb_structs" which allows us to see the
> largest free order lists and the serialized average fragment tree.

It might make sense to split these two structs into separate files.
On large filesystems, there will be millions of groups, so having
to dump both structs to find out information about one or the other
would be a lot of overhead.

Also, while the "mb_groups" file can be large (one line per group)
for millions of groups, the files added here may have millions of
groups on the same line. Text processing tools are used to dealing
with many lines in a file, but are relatively poor at dealing with
millions of characters on a single line...

It would be good to have a "summary" file that dumps general info
about these structs (e.g. the number of groups at each list order,
depth of the rbtree). That could be maintained easily for the c0
lists at least, since the list is locked whenever a group is added
and removed.

In Artem's patch to improve mballoc for large filesystems (which
didn't land, but was useful anyway), there was an "mb_alloc_stats"
file added:

https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel8/ext4-simple-blockalloc.patch#L102

/proc/fs/ldiskfs/dm-2/mb_alloc_stats:mballoc:
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: blocks: 0
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: reqs: 0
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: success: 0
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: extents_scanned: 0
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: goal_hits: 0
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: 2^n_hits: 0
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: breaks: 0
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: lost: 0
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: useless_c1_loops: 0
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: useless_c2_loops: 0
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: useless_c3_loops: 163
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: skipped_c1_loops: 0
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: skipped_c2_loops: 1230
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: skipped_c3_loops: 0
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: buddies_generated: 6305
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: buddies_time_used: 20165523
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: preallocated: 9943702
/proc/fs/ldiskfs/dm-2/mb_alloc_stats: discarded: 10506

that would still be useful to maintain, since only a few of the lines
are specific to his patch ({useless,skipped}_c[123]_loops)

> Signed-off-by: Harshad Shirwadkar <[email protected]>
> ---
> fs/ext4/ext4.h | 1 +
> fs/ext4/mballoc.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++
> fs/ext4/sysfs.c | 2 ++
> 3 files changed, 82 insertions(+)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index da12a083bf52..835e304e3113 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2809,6 +2809,7 @@ int __init ext4_fc_init_dentry_cache(void);
>
> /* mballoc.c */
> extern const struct seq_operations ext4_mb_seq_groups_ops;
> +extern const struct seq_operations ext4_mb_seq_structs_ops;
> extern long ext4_mb_stats;
> extern long ext4_mb_max_to_scan;
> extern int ext4_mb_init(struct super_block *);
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 413259477b03..ec4656903bd4 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2741,6 +2741,85 @@ const struct seq_operations ext4_mb_seq_groups_ops = {
> .show = ext4_mb_seq_groups_show,
> };
>
> +static void *ext4_mb_seq_structs_start(struct seq_file *seq, loff_t *pos)
> +{
> + struct super_block *sb = PDE_DATA(file_inode(seq->file));
> + unsigned long position;
> +
> + read_lock(&EXT4_SB(sb)->s_mb_rb_lock);
> +
> + if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + ext4_get_groups_count(sb))
> + return NULL;
> + position = *pos + 1;
> + return (void *) ((unsigned long) position);
> +}
> +
> +static void *ext4_mb_seq_structs_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> + struct super_block *sb = PDE_DATA(file_inode(seq->file));
> + unsigned long position;
> +
> + ++*pos;
> + if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + ext4_get_groups_count(sb))
> + return NULL;
> + position = *pos + 1;
> + return (void *) ((unsigned long) position);
> +}
> +
> +static int ext4_mb_seq_structs_show(struct seq_file *seq, void *v)
> +{
> + struct super_block *sb = PDE_DATA(file_inode(seq->file));
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> + unsigned long position = ((unsigned long) v);
> + struct ext4_group_info *grpinfo;
> + struct rb_node *n;
> + int i;
> +
> + position--;
> +
> + if (position >= MB_NUM_ORDERS(sb)) {
> + position -= MB_NUM_ORDERS(sb);
> + if (position == 0)
> + seq_puts(seq, "Group, Avg Fragment Size\n");
> + n = rb_first(&sbi->s_mb_avg_fragment_size_root);
> + for (i = 0; n && i < position; i++)
> + n = rb_next(n);
> + if (!n)
> + return 0;
> + grpinfo = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb);
> + seq_printf(seq, "%d, %d\n",
> + grpinfo->bb_group,
> + grpinfo->bb_fragments ? grpinfo->bb_free / grpinfo->bb_fragments : 0);
> + return 0;
> + }
> +
> + if (position == 0)
> + seq_puts(seq, "Largest Free Order Lists:\n");
> +
> + seq_printf(seq, "[%ld]: ", position);
> + list_for_each_entry(grpinfo, &sbi->s_mb_largest_free_orders[position],
> + bb_largest_free_order_node) {
> + seq_printf(seq, "%d ", grpinfo->bb_group);
> + }
> + seq_puts(seq, "\n");
> +
> + return 0;
> +}
> +
> +static void ext4_mb_seq_structs_stop(struct seq_file *seq, void *v)
> +{
> + struct super_block *sb = PDE_DATA(file_inode(seq->file));
> +
> + read_unlock(&EXT4_SB(sb)->s_mb_rb_lock);
> +}
> +
> +const struct seq_operations ext4_mb_seq_structs_ops = {
> + .start = ext4_mb_seq_structs_start,
> + .next = ext4_mb_seq_structs_next,
> + .stop = ext4_mb_seq_structs_stop,
> + .show = ext4_mb_seq_structs_show,
> +};
> +
> static struct kmem_cache *get_groupinfo_cache(int blocksize_bits)
> {
> int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE;
> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> index 4e27fe6ed3ae..828d58b98310 100644
> --- a/fs/ext4/sysfs.c
> +++ b/fs/ext4/sysfs.c
> @@ -527,6 +527,8 @@ int ext4_register_sysfs(struct super_block *sb)
> ext4_fc_info_show, sb);
> proc_create_seq_data("mb_groups", S_IRUGO, sbi->s_proc,
> &ext4_mb_seq_groups_ops, sb);
> + proc_create_seq_data("mb_structs", 0444, sbi->s_proc,
> + &ext4_mb_seq_structs_ops, sb);
> }
> return 0;
> }
> --
> 2.30.0.365.g02bc693789-goog
>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2021-01-30 22:13:22

by harshad shirwadkar

[permalink] [raw]
Subject: Re: [PATCH 3/4] ext4: improve cr 0 / cr 1 group scanning

On Sat, Jan 30, 2021 at 2:19 AM Andreas Dilger <[email protected]> wrote:
>
> On Jan 29, 2021, at 3:29 PM, Harshad Shirwadkar <[email protected]> wrote:
> >
> > Instead of traversing through groups linearly,
>
> For linear IO writes to a file, before checking any of the lists,
> it makes sense to still check the current and next group to see
> if they have free extents of the proper size? That would preserve
> linearity of writes to a file in the common few-threads-writing case.
>
> That would reduce contention on the list locks (as you mention later)
> compared to *every* thread checking the list each time, as well as
> improve the sequential allocation for files. The cr0/cr1 list/tree
> would only be used for finding a new allocation group if the current
> and next groups fail.
Yes, it would make sense to do that. But I think that logic should
reside outside of this code. And perhaps be combined with
ext4_mb_find_by_goal()? Keeping that logic out of this would keep this
code simple.
>
> > scan groups in specific
> > orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
> > largest free order >= the order of the request. So, with this patch,
> > we maintain all the ext4_group_info structs in lists for each possible
> > order.
>
> This would be more clear like (if I'm describing this correctly):
>
> we maintain lists for each possible order and insert each group
> into a list based on the largest free order in its buddy bitmap.
Yes, that's much better, I'll incorporate this in the commit message
in the next version.
>
> > During cr 0 allocation, we traverse these lists in the
> > increasing order of largest free orders. This allows us to find a
> > group with the best available cr 0 match in constant time. If nothing
> > can be found, we fallback to cr 1 immediately.
>
> Just to clarify, if allocating, say, an 8MB extent, this would jump
> straight to the 8MB group list to look for groups that have free
> extents in their 8MB buddy bitmap, and if that list is empty it will
> go to the 16MB list to look for a group, repeat as necessary until an
> extent is found (and split) or there are no larger extents and the
> cr0 pass is finished?
Yes that's right.
>
> > At CR1, the story is slightly different. We want to traverse in the
> > order of increasing average fragment size. For CR1, we maintain a rb
> > tree of groupinfos which is sorted by average fragment size. Instead
> > of traversing linearly, at CR1, we traverse in the order of increasing
> > average fragment size, starting at the most optimal group. This brings
> > down cr 1 search complexity to log(num groups).
> >
> > For cr >= 2, we just perform the linear search as before. Also, in
> > case of lock contention, we intermittently fallback to linear search
> > even in CR 0 and CR 1 cases. This allows us to proceed during the
> > allocation path even in case of high contention.
>
> Did you look at separate locks for each order of the cr0 list to
> reduce contention? It might be possible to opportunistically avoid
> locking any of the list heads when *checking* if the list is empty
> or not. I don't think there are serious issues with races if the
> list is already empty.
Ah, yes that's a good idea.

In general, I kept only one lock because these structures get accessed
together, but as you point out there's an opportunity to do better. As
you mentioned below, we can have one lock for the tree and perhaps
lock for each list.
>
> Having multiple list locks is fine if there is a specific locking
> order (larger to smaller) when multiple locks are taken. If there
> is a group with the correct order it only needs a single list lock,
> and if a higher order group is needed (requiring the extent to be
> split) then the next lower list would be locked to insert the group,
> which can be taken without dropping the higher-order list lock.
So, I think we can live without having to take multiple list locks.
Mainly because these structures (rb tree and lists) are more of a
"hint" and thus have some tolerance for inaccuracy. For example, in
the case that you mentioned above, let's say we have 2 threads and
both of them need the same higher order group. Also, let's assume that
both will end up splitting the extent. In that case, our function
ext4_mb_choose_next_group() will still make both the threads use the
same group. However, eventually only one of them will win (because
only one of them will be able to lock the groupinfo) and would update
our structures accordingly. After that the thread that lost, would be
able to find another right group by looking up our structures.

So, to summarize, I like the idea of having locks for each list since
that would reduce contention that we would otherwise see on one lock.
Also, these structures are not designed to be always in sync with
ext4_group_info i.e. it may happen that a group is found in the list
which isn't the right order for the group or a part of the rb tree is
not sorted, and that is okay. Such states would always be temporary
and would eventually get fixed. Allowing for this small inaccuracy,
allows us to not hold contentious locks for a long time and thereby it
increases the performance. It also simplifies the locking.
>
> When inserting groups back into the list, assuming that each thread
> will stick with the same group for some number of allocations (rather
> than having to re-lookup the same group each time), does it make more
> sense to put the group at the *end* of the list so that the next
> thread doing a lookup will find a different group than another thread
> was just allocating from?
That's a good idea.
>
> Is there any particular requirement why the rb lock and the list
> lock(s) need to be the same lock? I'd imagine that they do not,
> since a thread will be using either one list or the other. Making
> these separate locks and reducing the lock hold time would allow
> more threads to be doing useful work instead of spinning.
I agree, I'll do this in the next version.

Thanks,
Harshad
>
> Cheers, Andreas
>
> > Signed-off-by: Harshad Shirwadkar <[email protected]>
> > ---
> > fs/ext4/ext4.h | 6 ++
> > fs/ext4/mballoc.c | 223 ++++++++++++++++++++++++++++++++++++++++++++--
> > fs/ext4/mballoc.h | 1 +
> > 3 files changed, 222 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 6dd127942208..da12a083bf52 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1527,6 +1527,9 @@ struct ext4_sb_info {
> > unsigned int s_mb_free_pending;
> > struct list_head s_freed_data_list; /* List of blocks to be freed
> > after commit completed */
> > + struct rb_root s_mb_avg_fragment_size_root;
> > + struct list_head *s_mb_largest_free_orders;
> > + rwlock_t s_mb_rb_lock;
> >
> > /* tunables */
> > unsigned long s_stripe;
> > @@ -3304,11 +3307,14 @@ struct ext4_group_info {
> > ext4_grpblk_t bb_free; /* total free blocks */
> > ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
> > ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */
> > + ext4_group_t bb_group; /* Group number */
> > struct list_head bb_prealloc_list;
> > #ifdef DOUBLE_CHECK
> > void *bb_bitmap;
> > #endif
> > struct rw_semaphore alloc_sem;
> > + struct rb_node bb_avg_fragment_size_rb;
> > + struct list_head bb_largest_free_order_node;
> > ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
> > * regions, index is order.
> > * bb_counters[3] = 5 means
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 11c56b0e6f35..413259477b03 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -744,6 +744,193 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> > }
> > }
> >
> > +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
> > + int (*cmp)(struct rb_node *, struct rb_node *))
> > +{
> > + struct rb_node **iter = &root->rb_node, *parent = NULL;
> > +
> > + while (*iter) {
> > + parent = *iter;
> > + if (cmp(new, *iter))
> > + iter = &((*iter)->rb_left);
> > + else
> > + iter = &((*iter)->rb_right);
> > + }
> > +
> > + rb_link_node(new, parent, iter);
> > + rb_insert_color(new, root);
> > +}
> > +
> > +static int
> > +ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
> > +{
> > + struct ext4_group_info *grp1 = rb_entry(rb1,
> > + struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + struct ext4_group_info *grp2 = rb_entry(rb2,
> > + struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + int num_frags_1, num_frags_2;
> > +
> > + num_frags_1 = grp1->bb_fragments ?
> > + grp1->bb_free / grp1->bb_fragments : 0;
> > + num_frags_2 = grp2->bb_fragments ?
> > + grp2->bb_free / grp2->bb_fragments : 0;
> > +
> > + return (num_frags_1 < num_frags_2);
> > +}
> > +
> > +/*
> > + * Reinsert grpinfo into the avg_fragment_size tree and into the appropriate
> > + * largest_free_order list.
> > + */
> > +static void
> > +ext4_mb_reinsert_grpinfo(struct super_block *sb, struct ext4_group_info *grp)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +
> > + write_lock(&sbi->s_mb_rb_lock);
> > + if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
> > + rb_erase(&grp->bb_avg_fragment_size_rb,
> > + &sbi->s_mb_avg_fragment_size_root);
> > + RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
> > + }
> > +
> > + ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
> > + &grp->bb_avg_fragment_size_rb,
> > + ext4_mb_avg_fragment_size_cmp);
> > +
> > + list_del_init(&grp->bb_largest_free_order_node);
> > + if (grp->bb_largest_free_order >= 0)
> > + list_add(&grp->bb_largest_free_order_node,
> > + &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> > + write_unlock(&sbi->s_mb_rb_lock);
> > +}
> > +
> > +/*
> > + * ext4_mb_choose_next_group: choose next group for allocation.
> > + *
> > + * @ac Allocation Context
> > + * @new_cr This is an output parameter. If the there is no good group available
> > + * at current CR level, this field is updated to indicate the new cr
> > + * level that should be used.
> > + * @group This is an input / output parameter. As an input it indicates the last
> > + * group used for allocation. As output, this field indicates the
> > + * next group that should be used.
> > + * @ngroups Total number of groups
> > + */
> > +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> > + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> > + int avg_fragment_size, best_so_far, i;
> > + struct rb_node *node, *found;
> > + struct ext4_group_info *grp;
> > +
> > + *new_cr = ac->ac_criteria;
> > + if (*new_cr >= 2 ||
> > + !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
> > + goto inc_and_return;
> > +
> > + /*
> > + * If there is contention on the lock, instead of waiting for the lock
> > + * to become available, just continue searching lineraly.
> > + */
> > + if (!read_trylock(&sbi->s_mb_rb_lock))
> > + goto inc_and_return;
> > +
> > + if (*new_cr == 0) {
> > + grp = NULL;
> > +
> > + if (ac->ac_status == AC_STATUS_FOUND)
> > + goto inc_and_return;
> > +
> > + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> > + if (list_empty(&sbi->s_mb_largest_free_orders[i]))
> > + continue;
> > + grp = list_first_entry(&sbi->s_mb_largest_free_orders[i],
> > + struct ext4_group_info,
> > + bb_largest_free_order_node);
> > + break;
> > + }
> > +
> > + if (grp) {
> > + *group = grp->bb_group;
> > + goto done;
> > + }
> > + /* Increment cr and search again */
> > + *new_cr = 1;
> > + }
> > +
> > + /*
> > + * At CR 1, if enough groups are not loaded, we just fallback to
> > + * linear search
> > + */
> > + if (atomic_read(&sbi->s_mb_buddies_generated) <
> > + ext4_get_groups_count(ac->ac_sb)) {
> > + read_unlock(&sbi->s_mb_rb_lock);
> > + goto inc_and_return;
> > + }
> > +
> > + if (*new_cr == 1) {
> > + if (ac->ac_f_ex.fe_len > 0) {
> > + /* We have found something at CR 1 in the past */
> > + grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
> > + found = rb_next(&grp->bb_avg_fragment_size_rb);
> > + if (found) {
> > + grp = rb_entry(found, struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + *group = grp->bb_group;
> > + } else {
> > + *new_cr = 2;
> > + }
> > + goto done;
> > + }
> > +
> > + /* This is the first time we are searching in the tree */
> > + node = sbi->s_mb_avg_fragment_size_root.rb_node;
> > + best_so_far = 0;
> > + found = NULL;
> > +
> > + while (node) {
> > + grp = rb_entry(node, struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + avg_fragment_size = grp->bb_fragments ?
> > + grp->bb_free / grp->bb_fragments : 0;
> > + if (avg_fragment_size > ac->ac_g_ex.fe_len) {
> > + if (!best_so_far || avg_fragment_size < best_so_far) {
> > + best_so_far = avg_fragment_size;
> > + found = node;
> > + }
> > + }
> > + if (avg_fragment_size > ac->ac_g_ex.fe_len)
> > + node = node->rb_right;
> > + else
> > + node = node->rb_left;
> > + }
> > + if (found) {
> > + grp = rb_entry(found, struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + *group = grp->bb_group;
> > + } else {
> > + *new_cr = 2;
> > + }
> > + }
> > +done:
> > + read_unlock(&sbi->s_mb_rb_lock);
> > + ac->ac_last_optimal_group = *group;
> > + return;
> > +
> > +inc_and_return:
> > + /*
> > + * Artificially restricted ngroups for non-extent
> > + * files makes group > ngroups possible on first loop.
> > + */
> > + *group = *group + 1;
> > + if (*group >= ngroups)
> > + *group = 0;
> > +}
> > +
> > /*
> > * Cache the order of the largest free extent we have available in this block
> > * group.
> > @@ -818,6 +1005,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
> > period = get_cycles() - period;
> > atomic_inc(&sbi->s_mb_buddies_generated);
> > atomic64_add(period, &sbi->s_mb_generation_time);
> > + ext4_mb_reinsert_grpinfo(sb, grp);
> > }
> >
> > /* The buddy information is attached the buddy cache inode
> > @@ -1517,6 +1705,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
> >
> > done:
> > mb_set_largest_free_order(sb, e4b->bd_info);
> > + ext4_mb_reinsert_grpinfo(sb, e4b->bd_info);
> > mb_check_buddy(e4b);
> > }
> >
> > @@ -1653,6 +1842,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
> > }
> > mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
> >
> > + ext4_mb_reinsert_grpinfo(e4b->bd_sb, e4b->bd_info);
> > ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
> > mb_check_buddy(e4b);
> >
> > @@ -2345,17 +2535,20 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> > * from the goal value specified
> > */
> > group = ac->ac_g_ex.fe_group;
> > + ac->ac_last_optimal_group = group;
> > prefetch_grp = group;
> >
> > - for (i = 0; i < ngroups; group++, i++) {
> > - int ret = 0;
> > + for (i = 0; i < ngroups; i++) {
> > + int ret = 0, new_cr;
> > +
> > cond_resched();
> > - /*
> > - * Artificially restricted ngroups for non-extent
> > - * files makes group > ngroups possible on first loop.
> > - */
> > - if (group >= ngroups)
> > - group = 0;
> > +
> > + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
> > +
> > + if (new_cr != cr) {
> > + cr = new_cr;
> > + goto repeat;
> > + }
> >
> > /*
> > * Batch reads of the block allocation bitmaps
> > @@ -2650,7 +2843,10 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
> > INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
> > init_rwsem(&meta_group_info[i]->alloc_sem);
> > meta_group_info[i]->bb_free_root = RB_ROOT;
> > + INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
> > + RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb);
> > meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
> > + meta_group_info[i]->bb_group = group;
> >
> > mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
> > return 0;
> > @@ -2840,6 +3036,15 @@ int ext4_mb_init(struct super_block *sb)
> > i++;
> > } while (i < MB_NUM_ORDERS(sb));
> >
> > + sbi->s_mb_avg_fragment_size_root = RB_ROOT;
> > + sbi->s_mb_largest_free_orders =
> > + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> > + GFP_KERNEL);
> > + if (!sbi->s_mb_largest_free_orders)
> > + goto out;
> > + for (i = 0; i < MB_NUM_ORDERS(sb); i++)
> > + INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
> > + rwlock_init(&sbi->s_mb_rb_lock);
> >
> > spin_lock_init(&sbi->s_md_lock);
> > sbi->s_mb_free_pending = 0;
> > @@ -2903,6 +3108,7 @@ int ext4_mb_init(struct super_block *sb)
> > free_percpu(sbi->s_locality_groups);
> > sbi->s_locality_groups = NULL;
> > out:
> > + kfree(sbi->s_mb_largest_free_orders);
> > kfree(sbi->s_mb_offsets);
> > sbi->s_mb_offsets = NULL;
> > kfree(sbi->s_mb_maxs);
> > @@ -2959,6 +3165,7 @@ int ext4_mb_release(struct super_block *sb)
> > kvfree(group_info);
> > rcu_read_unlock();
> > }
> > + kfree(sbi->s_mb_largest_free_orders);
> > kfree(sbi->s_mb_offsets);
> > kfree(sbi->s_mb_maxs);
> > iput(sbi->s_buddy_cache);
> > diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> > index 68111a10cfee..57b44c7320b2 100644
> > --- a/fs/ext4/mballoc.h
> > +++ b/fs/ext4/mballoc.h
> > @@ -166,6 +166,7 @@ struct ext4_allocation_context {
> > /* copy of the best found extent taken before preallocation efforts */
> > struct ext4_free_extent ac_f_ex;
> >
> > + ext4_group_t ac_last_optimal_group;
> > __u16 ac_groups_scanned;
> > __u16 ac_found;
> > __u16 ac_tail;
> > --
> > 2.30.0.365.g02bc693789-goog
> >
>
>
> Cheers, Andreas
>
>
>
>
>

2021-02-01 23:39:49

by harshad shirwadkar

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: add proc files to monitor new structures

I see. That makes sense. Also as you mentioned it would be good to
have a summary file that shows a summary of the structures. In fact,
looking at the contents of mb_groups and if we add a summary file, I
think we probably should just get rid of these verbose files.
mb_groups has information about largest free order and fragment size,
using which, we can reconstruct these data structures and we can then
verify the correctness using the summary file.

On Sat, Jan 30, 2021 at 2:58 AM Andreas Dilger <[email protected]> wrote:
>
> On Jan 29, 2021, at 3:29 PM, Harshad Shirwadkar <[email protected]> wrote:
> >
> > This patch adds a new file "mb_structs" which allows us to see the
> > largest free order lists and the serialized average fragment tree.
>
> It might make sense to split these two structs into separate files.
> On large filesystems, there will be millions of groups, so having
> to dump both structs to find out information about one or the other
> would be a lot of overhead.
>
> Also, while the "mb_groups" file can be large (one line per group)
> for millions of groups, the files added here may have millions of
> groups on the same line. Text processing tools are used to dealing
> with many lines in a file, but are relatively poor at dealing with
> millions of characters on a single line...
>
> It would be good to have a "summary" file that dumps general info
> about these structs (e.g. the number of groups at each list order,
> depth of the rbtree). That could be maintained easily for the c0
> lists at least, since the list is locked whenever a group is added
> and removed.
>
> In Artem's patch to improve mballoc for large filesystems (which
> didn't land, but was useful anyway), there was an "mb_alloc_stats"
> file added:
>
> https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel8/ext4-simple-blockalloc.patch#L102
>
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats:mballoc:
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: blocks: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: reqs: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: success: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: extents_scanned: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: goal_hits: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: 2^n_hits: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: breaks: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: lost: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: useless_c1_loops: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: useless_c2_loops: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: useless_c3_loops: 163
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: skipped_c1_loops: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: skipped_c2_loops: 1230
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: skipped_c3_loops: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: buddies_generated: 6305
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: buddies_time_used: 20165523
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: preallocated: 9943702
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: discarded: 10506
>
> that would still be useful to maintain, since only a few of the lines
> are specific to his patch ({useless,skipped}_c[123]_loops)
I agree, this is very useful to have! it gives us a way to benchmark
efficiency of mballoc. I'll add mb_alloc_stats here.

Thanks,
Harshad
>
> > Signed-off-by: Harshad Shirwadkar <[email protected]>
> > ---
> > fs/ext4/ext4.h | 1 +
> > fs/ext4/mballoc.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++
> > fs/ext4/sysfs.c | 2 ++
> > 3 files changed, 82 insertions(+)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index da12a083bf52..835e304e3113 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -2809,6 +2809,7 @@ int __init ext4_fc_init_dentry_cache(void);
> >
> > /* mballoc.c */
> > extern const struct seq_operations ext4_mb_seq_groups_ops;
> > +extern const struct seq_operations ext4_mb_seq_structs_ops;
> > extern long ext4_mb_stats;
> > extern long ext4_mb_max_to_scan;
> > extern int ext4_mb_init(struct super_block *);
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 413259477b03..ec4656903bd4 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -2741,6 +2741,85 @@ const struct seq_operations ext4_mb_seq_groups_ops = {
> > .show = ext4_mb_seq_groups_show,
> > };
> >
> > +static void *ext4_mb_seq_structs_start(struct seq_file *seq, loff_t *pos)
> > +{
> > + struct super_block *sb = PDE_DATA(file_inode(seq->file));
> > + unsigned long position;
> > +
> > + read_lock(&EXT4_SB(sb)->s_mb_rb_lock);
> > +
> > + if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + ext4_get_groups_count(sb))
> > + return NULL;
> > + position = *pos + 1;
> > + return (void *) ((unsigned long) position);
> > +}
> > +
> > +static void *ext4_mb_seq_structs_next(struct seq_file *seq, void *v, loff_t *pos)
> > +{
> > + struct super_block *sb = PDE_DATA(file_inode(seq->file));
> > + unsigned long position;
> > +
> > + ++*pos;
> > + if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + ext4_get_groups_count(sb))
> > + return NULL;
> > + position = *pos + 1;
> > + return (void *) ((unsigned long) position);
> > +}
> > +
> > +static int ext4_mb_seq_structs_show(struct seq_file *seq, void *v)
> > +{
> > + struct super_block *sb = PDE_DATA(file_inode(seq->file));
> > + struct ext4_sb_info *sbi = EXT4_SB(sb);
> > + unsigned long position = ((unsigned long) v);
> > + struct ext4_group_info *grpinfo;
> > + struct rb_node *n;
> > + int i;
> > +
> > + position--;
> > +
> > + if (position >= MB_NUM_ORDERS(sb)) {
> > + position -= MB_NUM_ORDERS(sb);
> > + if (position == 0)
> > + seq_puts(seq, "Group, Avg Fragment Size\n");
> > + n = rb_first(&sbi->s_mb_avg_fragment_size_root);
> > + for (i = 0; n && i < position; i++)
> > + n = rb_next(n);
> > + if (!n)
> > + return 0;
> > + grpinfo = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb);
> > + seq_printf(seq, "%d, %d\n",
> > + grpinfo->bb_group,
> > + grpinfo->bb_fragments ? grpinfo->bb_free / grpinfo->bb_fragments : 0);
> > + return 0;
> > + }
> > +
> > + if (position == 0)
> > + seq_puts(seq, "Largest Free Order Lists:\n");
> > +
> > + seq_printf(seq, "[%ld]: ", position);
> > + list_for_each_entry(grpinfo, &sbi->s_mb_largest_free_orders[position],
> > + bb_largest_free_order_node) {
> > + seq_printf(seq, "%d ", grpinfo->bb_group);
> > + }
> > + seq_puts(seq, "\n");
> > +
> > + return 0;
> > +}
> > +
> > +static void ext4_mb_seq_structs_stop(struct seq_file *seq, void *v)
> > +{
> > + struct super_block *sb = PDE_DATA(file_inode(seq->file));
> > +
> > + read_unlock(&EXT4_SB(sb)->s_mb_rb_lock);
> > +}
> > +
> > +const struct seq_operations ext4_mb_seq_structs_ops = {
> > + .start = ext4_mb_seq_structs_start,
> > + .next = ext4_mb_seq_structs_next,
> > + .stop = ext4_mb_seq_structs_stop,
> > + .show = ext4_mb_seq_structs_show,
> > +};
> > +
> > static struct kmem_cache *get_groupinfo_cache(int blocksize_bits)
> > {
> > int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE;
> > diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> > index 4e27fe6ed3ae..828d58b98310 100644
> > --- a/fs/ext4/sysfs.c
> > +++ b/fs/ext4/sysfs.c
> > @@ -527,6 +527,8 @@ int ext4_register_sysfs(struct super_block *sb)
> > ext4_fc_info_show, sb);
> > proc_create_seq_data("mb_groups", S_IRUGO, sbi->s_proc,
> > &ext4_mb_seq_groups_ops, sb);
> > + proc_create_seq_data("mb_structs", 0444, sbi->s_proc,
> > + &ext4_mb_seq_structs_ops, sb);
> > }
> > return 0;
> > }
> > --
> > 2.30.0.365.g02bc693789-goog
> >
>
>
> Cheers, Andreas
>
>
>
>
>