On Sat, Jan 30, 2021 at 2:19 AM Andreas Dilger <[email protected]> wrote:
>
> On Jan 29, 2021, at 3:29 PM, Harshad Shirwadkar <[email protected]> wrote:
> >
> > Instead of traversing through groups linearly,
>
> For linear IO writes to a file, before checking any of the lists,
> it makes sense to still check the current and next group to see
> if they have free extents of the proper size? That would preserve
> linearity of writes to a file in the common few-threads-writing case.
>
> That would reduce contention on the list locks (as you mention later)
> compared to *every* thread checking the list each time, as well as
> improve the sequential allocation for files. The cr0/cr1 list/tree
> would only be used for finding a new allocation group if the current
> and next groups fail.
Yes, it would make sense to do that. But I think that logic should
reside outside of this code. And perhaps be combined with
ext4_mb_find_by_goal()? Keeping that logic out of this would keep this
code simple.
>
> > scan groups in specific
> > orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
> > largest free order >= the order of the request. So, with this patch,
> > we maintain all the ext4_group_info structs in lists for each possible
> > order.
>
> This would be more clear like (if I'm describing this correctly):
>
> we maintain lists for each possible order and insert each group
> into a list based on the largest free order in its buddy bitmap.
Yes, that's much better, I'll incorporate this in the commit message
in the next version.
>
> > During cr 0 allocation, we traverse these lists in the
> > increasing order of largest free orders. This allows us to find a
> > group with the best available cr 0 match in constant time. If nothing
> > can be found, we fallback to cr 1 immediately.
>
> Just to clarify, if allocating, say, an 8MB extent, this would jump
> straight to the 8MB group list to look for groups that have free
> extents in their 8MB buddy bitmap, and if that list is empty it will
> go to the 16MB list to look for a group, repeat as necessary until an
> extent is found (and split) or there are no larger extents and the
> cr0 pass is finished?
Yes that's right.
>
> > At CR1, the story is slightly different. We want to traverse in the
> > order of increasing average fragment size. For CR1, we maintain a rb
> > tree of groupinfos which is sorted by average fragment size. Instead
> > of traversing linearly, at CR1, we traverse in the order of increasing
> > average fragment size, starting at the most optimal group. This brings
> > down cr 1 search complexity to log(num groups).
> >
> > For cr >= 2, we just perform the linear search as before. Also, in
> > case of lock contention, we intermittently fallback to linear search
> > even in CR 0 and CR 1 cases. This allows us to proceed during the
> > allocation path even in case of high contention.
>
> Did you look at separate locks for each order of the cr0 list to
> reduce contention? It might be possible to opportunistically avoid
> locking any of the list heads when *checking* if the list is empty
> or not. I don't think there are serious issues with races if the
> list is already empty.
Ah, yes that's a good idea.

In general, I kept only one lock because these structures get accessed
together, but as you point out there's an opportunity to do better. As
you mentioned below, we can have one lock for the tree and perhaps
lock for each list.
>
> Having multiple list locks is fine if there is a specific locking
> order (larger to smaller) when multiple locks are taken. If there
> is a group with the correct order it only needs a single list lock,
> and if a higher order group is needed (requiring the extent to be
> split) then the next lower list would be locked to insert the group,
> which can be taken without dropping the higher-order list lock.
So, I think we can live without having to take multiple list locks.
Mainly because these structures (rb tree and lists) are more of a
"hint" and thus have some tolerance for inaccuracy. For example, in
the case that you mentioned above, let's say we have 2 threads and
both of them need the same higher order group. Also, let's assume that
both will end up splitting the extent. In that case, our function
ext4_mb_choose_next_group() will still make both the threads use the
same group. However, eventually only one of them will win (because
only one of them will be able to lock the groupinfo) and would update
our structures accordingly. After that the thread that lost, would be
able to find another right group by looking up our structures.

So, to summarize, I like the idea of having locks for each list since
that would reduce contention that we would otherwise see on one lock.
Also, these structures are not designed to be always in sync with
ext4_group_info i.e. it may happen that a group is found in the list
which isn't the right order for the group or a part of the rb tree is
not sorted, and that is okay. Such states would always be temporary
and would eventually get fixed. Allowing for this small inaccuracy,
allows us to not hold contentious locks for a long time and thereby it
increases the performance. It also simplifies the locking.
>
> When inserting groups back into the list, assuming that each thread
> will stick with the same group for some number of allocations (rather
> than having to re-lookup the same group each time), does it make more
> sense to put the group at the *end* of the list so that the next
> thread doing a lookup will find a different group than another thread
> was just allocating from?
That's a good idea.
>
> Is there any particular requirement why the rb lock and the list
> lock(s) need to be the same lock? I'd imagine that they do not,
> since a thread will be using either one list or the other. Making
> these separate locks and reducing the lock hold time would allow
> more threads to be doing useful work instead of spinning.
I agree, I'll do this in the next version.

Thanks,
Harshad
>
> Cheers, Andreas
>
> > Signed-off-by: Harshad Shirwadkar <[email protected]>
> > ---
> > fs/ext4/ext4.h | 6 ++
> > fs/ext4/mballoc.c | 223 ++++++++++++++++++++++++++++++++++++++++++++--
> > fs/ext4/mballoc.h | 1 +
> > 3 files changed, 222 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 6dd127942208..da12a083bf52 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1527,6 +1527,9 @@ struct ext4_sb_info {
> > unsigned int s_mb_free_pending;
> > struct list_head s_freed_data_list; /* List of blocks to be freed
> > after commit completed */
> > + struct rb_root s_mb_avg_fragment_size_root;
> > + struct list_head *s_mb_largest_free_orders;
> > + rwlock_t s_mb_rb_lock;
> >
> > /* tunables */
> > unsigned long s_stripe;
> > @@ -3304,11 +3307,14 @@ struct ext4_group_info {
> > ext4_grpblk_t bb_free; /* total free blocks */
> > ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
> > ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */
> > + ext4_group_t bb_group; /* Group number */
> > struct list_head bb_prealloc_list;
> > #ifdef DOUBLE_CHECK
> > void *bb_bitmap;
> > #endif
> > struct rw_semaphore alloc_sem;
> > + struct rb_node bb_avg_fragment_size_rb;
> > + struct list_head bb_largest_free_order_node;
> > ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
> > * regions, index is order.
> > * bb_counters[3] = 5 means
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 11c56b0e6f35..413259477b03 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -744,6 +744,193 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> > }
> > }
> >
> > +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
> > + int (*cmp)(struct rb_node *, struct rb_node *))
> > +{
> > + struct rb_node **iter = &root->rb_node, *parent = NULL;
> > +
> > + while (*iter) {
> > + parent = *iter;
> > + if (cmp(new, *iter))
> > + iter = &((*iter)->rb_left);
> > + else
> > + iter = &((*iter)->rb_right);
> > + }
> > +
> > + rb_link_node(new, parent, iter);
> > + rb_insert_color(new, root);
> > +}
> > +
> > +static int
> > +ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
> > +{
> > + struct ext4_group_info *grp1 = rb_entry(rb1,
> > + struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + struct ext4_group_info *grp2 = rb_entry(rb2,
> > + struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + int num_frags_1, num_frags_2;
> > +
> > + num_frags_1 = grp1->bb_fragments ?
> > + grp1->bb_free / grp1->bb_fragments : 0;
> > + num_frags_2 = grp2->bb_fragments ?
> > + grp2->bb_free / grp2->bb_fragments : 0;
> > +
> > + return (num_frags_1 < num_frags_2);
> > +}
> > +
> > +/*
> > + * Reinsert grpinfo into the avg_fragment_size tree and into the appropriate
> > + * largest_free_order list.
> > + */
> > +static void
> > +ext4_mb_reinsert_grpinfo(struct super_block *sb, struct ext4_group_info *grp)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +
> > + write_lock(&sbi->s_mb_rb_lock);
> > + if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
> > + rb_erase(&grp->bb_avg_fragment_size_rb,
> > + &sbi->s_mb_avg_fragment_size_root);
> > + RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
> > + }
> > +
> > + ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
> > + &grp->bb_avg_fragment_size_rb,
> > + ext4_mb_avg_fragment_size_cmp);
> > +
> > + list_del_init(&grp->bb_largest_free_order_node);
> > + if (grp->bb_largest_free_order >= 0)
> > + list_add(&grp->bb_largest_free_order_node,
> > + &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> > + write_unlock(&sbi->s_mb_rb_lock);
> > +}
> > +
> > +/*
> > + * ext4_mb_choose_next_group: choose next group for allocation.
> > + *
> > + * @ac Allocation Context
> > + * @new_cr This is an output parameter. If the there is no good group available
> > + * at current CR level, this field is updated to indicate the new cr
> > + * level that should be used.
> > + * @group This is an input / output parameter. As an input it indicates the last
> > + * group used for allocation. As output, this field indicates the
> > + * next group that should be used.
> > + * @ngroups Total number of groups
> > + */
> > +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> > + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> > + int avg_fragment_size, best_so_far, i;
> > + struct rb_node *node, *found;
> > + struct ext4_group_info *grp;
> > +
> > + *new_cr = ac->ac_criteria;
> > + if (*new_cr >= 2 ||
> > + !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
> > + goto inc_and_return;
> > +
> > + /*
> > + * If there is contention on the lock, instead of waiting for the lock
> > + * to become available, just continue searching lineraly.
> > + */
> > + if (!read_trylock(&sbi->s_mb_rb_lock))
> > + goto inc_and_return;
> > +
> > + if (*new_cr == 0) {
> > + grp = NULL;
> > +
> > + if (ac->ac_status == AC_STATUS_FOUND)
> > + goto inc_and_return;
> > +
> > + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> > + if (list_empty(&sbi->s_mb_largest_free_orders[i]))
> > + continue;
> > + grp = list_first_entry(&sbi->s_mb_largest_free_orders[i],
> > + struct ext4_group_info,
> > + bb_largest_free_order_node);
> > + break;
> > + }
> > +
> > + if (grp) {
> > + *group = grp->bb_group;
> > + goto done;
> > + }
> > + /* Increment cr and search again */
> > + *new_cr = 1;
> > + }
> > +
> > + /*
> > + * At CR 1, if enough groups are not loaded, we just fallback to
> > + * linear search
> > + */
> > + if (atomic_read(&sbi->s_mb_buddies_generated) <
> > + ext4_get_groups_count(ac->ac_sb)) {
> > + read_unlock(&sbi->s_mb_rb_lock);
> > + goto inc_and_return;
> > + }
> > +
> > + if (*new_cr == 1) {
> > + if (ac->ac_f_ex.fe_len > 0) {
> > + /* We have found something at CR 1 in the past */
> > + grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
> > + found = rb_next(&grp->bb_avg_fragment_size_rb);
> > + if (found) {
> > + grp = rb_entry(found, struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + *group = grp->bb_group;
> > + } else {
> > + *new_cr = 2;
> > + }
> > + goto done;
> > + }
> > +
> > + /* This is the first time we are searching in the tree */
> > + node = sbi->s_mb_avg_fragment_size_root.rb_node;
> > + best_so_far = 0;
> > + found = NULL;
> > +
> > + while (node) {
> > + grp = rb_entry(node, struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + avg_fragment_size = grp->bb_fragments ?
> > + grp->bb_free / grp->bb_fragments : 0;
> > + if (avg_fragment_size > ac->ac_g_ex.fe_len) {
> > + if (!best_so_far || avg_fragment_size < best_so_far) {
> > + best_so_far = avg_fragment_size;
> > + found = node;
> > + }
> > + }
> > + if (avg_fragment_size > ac->ac_g_ex.fe_len)
> > + node = node->rb_right;
> > + else
> > + node = node->rb_left;
> > + }
> > + if (found) {
> > + grp = rb_entry(found, struct ext4_group_info,
> > + bb_avg_fragment_size_rb);
> > + *group = grp->bb_group;
> > + } else {
> > + *new_cr = 2;
> > + }
> > + }
> > +done:
> > + read_unlock(&sbi->s_mb_rb_lock);
> > + ac->ac_last_optimal_group = *group;
> > + return;
> > +
> > +inc_and_return:
> > + /*
> > + * Artificially restricted ngroups for non-extent
> > + * files makes group > ngroups possible on first loop.
> > + */
> > + *group = *group + 1;
> > + if (*group >= ngroups)
> > + *group = 0;
> > +}
> > +
> > /*
> > * Cache the order of the largest free extent we have available in this block
> > * group.
> > @@ -818,6 +1005,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
> > period = get_cycles() - period;
> > atomic_inc(&sbi->s_mb_buddies_generated);
> > atomic64_add(period, &sbi->s_mb_generation_time);
> > + ext4_mb_reinsert_grpinfo(sb, grp);
> > }
> >
> > /* The buddy information is attached the buddy cache inode
> > @@ -1517,6 +1705,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
> >
> > done:
> > mb_set_largest_free_order(sb, e4b->bd_info);
> > + ext4_mb_reinsert_grpinfo(sb, e4b->bd_info);
> > mb_check_buddy(e4b);
> > }
> >
> > @@ -1653,6 +1842,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
> > }
> > mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
> >
> > + ext4_mb_reinsert_grpinfo(e4b->bd_sb, e4b->bd_info);
> > ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
> > mb_check_buddy(e4b);
> >
> > @@ -2345,17 +2535,20 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> > * from the goal value specified
> > */
> > group = ac->ac_g_ex.fe_group;
> > + ac->ac_last_optimal_group = group;
> > prefetch_grp = group;
> >
> > - for (i = 0; i < ngroups; group++, i++) {
> > - int ret = 0;
> > + for (i = 0; i < ngroups; i++) {
> > + int ret = 0, new_cr;
> > +
> > cond_resched();
> > - /*
> > - * Artificially restricted ngroups for non-extent
> > - * files makes group > ngroups possible on first loop.
> > - */
> > - if (group >= ngroups)
> > - group = 0;
> > +
> > + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
> > +
> > + if (new_cr != cr) {
> > + cr = new_cr;
> > + goto repeat;
> > + }
> >
> > /*
> > * Batch reads of the block allocation bitmaps
> > @@ -2650,7 +2843,10 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
> > INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
> > init_rwsem(&meta_group_info[i]->alloc_sem);
> > meta_group_info[i]->bb_free_root = RB_ROOT;
> > + INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
> > + RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb);
> > meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
> > + meta_group_info[i]->bb_group = group;
> >
> > mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
> > return 0;
> > @@ -2840,6 +3036,15 @@ int ext4_mb_init(struct super_block *sb)
> > i++;
> > } while (i < MB_NUM_ORDERS(sb));
> >
> > + sbi->s_mb_avg_fragment_size_root = RB_ROOT;
> > + sbi->s_mb_largest_free_orders =
> > + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> > + GFP_KERNEL);
> > + if (!sbi->s_mb_largest_free_orders)
> > + goto out;
> > + for (i = 0; i < MB_NUM_ORDERS(sb); i++)
> > + INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
> > + rwlock_init(&sbi->s_mb_rb_lock);
> >
> > spin_lock_init(&sbi->s_md_lock);
> > sbi->s_mb_free_pending = 0;
> > @@ -2903,6 +3108,7 @@ int ext4_mb_init(struct super_block *sb)
> > free_percpu(sbi->s_locality_groups);
> > sbi->s_locality_groups = NULL;
> > out:
> > + kfree(sbi->s_mb_largest_free_orders);
> > kfree(sbi->s_mb_offsets);
> > sbi->s_mb_offsets = NULL;
> > kfree(sbi->s_mb_maxs);
> > @@ -2959,6 +3165,7 @@ int ext4_mb_release(struct super_block *sb)
> > kvfree(group_info);
> > rcu_read_unlock();
> > }
> > + kfree(sbi->s_mb_largest_free_orders);
> > kfree(sbi->s_mb_offsets);
> > kfree(sbi->s_mb_maxs);
> > iput(sbi->s_buddy_cache);
> > diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> > index 68111a10cfee..57b44c7320b2 100644
> > --- a/fs/ext4/mballoc.h
> > +++ b/fs/ext4/mballoc.h
> > @@ -166,6 +166,7 @@ struct ext4_allocation_context {
> > /* copy of the best found extent taken before preallocation efforts */
> > struct ext4_free_extent ac_f_ex;
> >
> > + ext4_group_t ac_last_optimal_group;
> > __u16 ac_groups_scanned;
> > __u16 ac_found;
> > __u16 ac_tail;
> > --
> > 2.30.0.365.g02bc693789-goog
> >
>
>
> Cheers, Andreas
>
>
>
>
>

2021-02-01 23:39:49

by harshad shirwadkar

[permalink] [raw]

Subject: Re: [PATCH 4/4] ext4: add proc files to monitor new structures

I see. That makes sense. Also as you mentioned it would be good to
have a summary file that shows a summary of the structures. In fact,
looking at the contents of mb_groups and if we add a summary file, I
think we probably should just get rid of these verbose files.
mb_groups has information about largest free order and fragment size,
using which, we can reconstruct these data structures and we can then
verify the correctness using the summary file.

On Sat, Jan 30, 2021 at 2:58 AM Andreas Dilger <[email protected]> wrote:
>
> On Jan 29, 2021, at 3:29 PM, Harshad Shirwadkar <[email protected]> wrote:
> >
> > This patch adds a new file "mb_structs" which allows us to see the
> > largest free order lists and the serialized average fragment tree.
>
> It might make sense to split these two structs into separate files.
> On large filesystems, there will be millions of groups, so having
> to dump both structs to find out information about one or the other
> would be a lot of overhead.
>
> Also, while the "mb_groups" file can be large (one line per group)
> for millions of groups, the files added here may have millions of
> groups on the same line. Text processing tools are used to dealing
> with many lines in a file, but are relatively poor at dealing with
> millions of characters on a single line...
>
> It would be good to have a "summary" file that dumps general info
> about these structs (e.g. the number of groups at each list order,
> depth of the rbtree). That could be maintained easily for the c0
> lists at least, since the list is locked whenever a group is added
> and removed.
>
> In Artem's patch to improve mballoc for large filesystems (which
> didn't land, but was useful anyway), there was an "mb_alloc_stats"
> file added:
>
> https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel8/ext4-simple-blockalloc.patch#L102
>
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats:mballoc:
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: blocks: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: reqs: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: success: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: extents_scanned: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: goal_hits: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: 2^n_hits: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: breaks: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: lost: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: useless_c1_loops: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: useless_c2_loops: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: useless_c3_loops: 163
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: skipped_c1_loops: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: skipped_c2_loops: 1230
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: skipped_c3_loops: 0
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: buddies_generated: 6305
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: buddies_time_used: 20165523
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: preallocated: 9943702
> /proc/fs/ldiskfs/dm-2/mb_alloc_stats: discarded: 10506
>
> that would still be useful to maintain, since only a few of the lines
> are specific to his patch ({useless,skipped}_c[123]_loops)
I agree, this is very useful to have! it gives us a way to benchmark
efficiency of mballoc. I'll add mb_alloc_stats here.

Thanks,
Harshad
>
> > Signed-off-by: Harshad Shirwadkar <[email protected]>
> > ---
> > fs/ext4/ext4.h | 1 +
> > fs/ext4/mballoc.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++
> > fs/ext4/sysfs.c | 2 ++
> > 3 files changed, 82 insertions(+)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index da12a083bf52..835e304e3113 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -2809,6 +2809,7 @@ int __init ext4_fc_init_dentry_cache(void);
> >
> > /* mballoc.c */
> > extern const struct seq_operations ext4_mb_seq_groups_ops;
> > +extern const struct seq_operations ext4_mb_seq_structs_ops;
> > extern long ext4_mb_stats;
> > extern long ext4_mb_max_to_scan;
> > extern int ext4_mb_init(struct super_block *);
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 413259477b03..ec4656903bd4 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -2741,6 +2741,85 @@ const struct seq_operations ext4_mb_seq_groups_ops = {
> > .show = ext4_mb_seq_groups_show,
> > };
> >
> > +static void *ext4_mb_seq_structs_start(struct seq_file *seq, loff_t *pos)
> > +{
> > + struct super_block *sb = PDE_DATA(file_inode(seq->file));
> > + unsigned long position;
> > +
> > + read_lock(&EXT4_SB(sb)->s_mb_rb_lock);
> > +
> > + if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + ext4_get_groups_count(sb))
> > + return NULL;
> > + position = *pos + 1;
> > + return (void *) ((unsigned long) position);
> > +}
> > +
> > +static void *ext4_mb_seq_structs_next(struct seq_file *seq, void *v, loff_t *pos)
> > +{
> > + struct super_block *sb = PDE_DATA(file_inode(seq->file));
> > + unsigned long position;
> > +
> > + ++*pos;
> > + if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + ext4_get_groups_count(sb))
> > + return NULL;
> > + position = *pos + 1;
> > + return (void *) ((unsigned long) position);
> > +}
> > +
> > +static int ext4_mb_seq_structs_show(struct seq_file *seq, void *v)
> > +{
> > + struct super_block *sb = PDE_DATA(file_inode(seq->file));
> > + struct ext4_sb_info *sbi = EXT4_SB(sb);
> > + unsigned long position = ((unsigned long) v);
> > + struct ext4_group_info *grpinfo;
> > + struct rb_node *n;
> > + int i;
> > +
> > + position--;
> > +
> > + if (position >= MB_NUM_ORDERS(sb)) {
> > + position -= MB_NUM_ORDERS(sb);
> > + if (position == 0)
> > + seq_puts(seq, "Group, Avg Fragment Size\n");
> > + n = rb_first(&sbi->s_mb_avg_fragment_size_root);
> > + for (i = 0; n && i < position; i++)
> > + n = rb_next(n);
> > + if (!n)
> > + return 0;
> > + grpinfo = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb);
> > + seq_printf(seq, "%d, %d\n",
> > + grpinfo->bb_group,
> > + grpinfo->bb_fragments ? grpinfo->bb_free / grpinfo->bb_fragments : 0);
> > + return 0;
> > + }
> > +
> > + if (position == 0)
> > + seq_puts(seq, "Largest Free Order Lists:\n");
> > +
> > + seq_printf(seq, "[%ld]: ", position);
> > + list_for_each_entry(grpinfo, &sbi->s_mb_largest_free_orders[position],
> > + bb_largest_free_order_node) {
> > + seq_printf(seq, "%d ", grpinfo->bb_group);
> > + }
> > + seq_puts(seq, "\n");
> > +
> > + return 0;
> > +}
> > +
> > +static void ext4_mb_seq_structs_stop(struct seq_file *seq, void *v)
> > +{
> > + struct super_block *sb = PDE_DATA(file_inode(seq->file));
> > +
> > + read_unlock(&EXT4_SB(sb)->s_mb_rb_lock);
> > +}
> > +
> > +const struct seq_operations ext4_mb_seq_structs_ops = {
> > + .start = ext4_mb_seq_structs_start,
> > + .next = ext4_mb_seq_structs_next,
> > + .stop = ext4_mb_seq_structs_stop,
> > + .show = ext4_mb_seq_structs_show,
> > +};
> > +
> > static struct kmem_cache *get_groupinfo_cache(int blocksize_bits)
> > {
> > int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE;
> > diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> > index 4e27fe6ed3ae..828d58b98310 100644
> > --- a/fs/ext4/sysfs.c
> > +++ b/fs/ext4/sysfs.c
> > @@ -527,6 +527,8 @@ int ext4_register_sysfs(struct super_block *sb)
> > ext4_fc_info_show, sb);
> > proc_create_seq_data("mb_groups", S_IRUGO, sbi->s_proc,
> > &ext4_mb_seq_groups_ops, sb);
> > + proc_create_seq_data("mb_structs", 0444, sbi->s_proc,
> > + &ext4_mb_seq_structs_ops, sb);
> > }
> > return 0;
> > }
> > --
> > 2.30.0.365.g02bc693789-goog
> >
>
>
> Cheers, Andreas
>
>
>
>
>