2023-07-28 08:39:07

by Bobi Jam

[permalink] [raw]
Subject: [PATCH 1/2] ext4: optimize metadata allocation for hybrid LUNs

With LVM it is possible to create an LV with SSD storage at the
beginning of the LV and HDD storage at the end of the LV, and use that
to separate ext4 metadata allocations (that need small random IOs)
from data allocations (that are better suited for large sequential
IOs) depending on the type of underlying storage. Between 0.5-1.0% of
the filesystem capacity would need to be high-IOPS storage in order to
hold all of the internal metadata.

This would improve performance for inode and other metadata access,
such as ls, find, e2fsck, and in general improve file access latency,
modification, truncate, unlink, transaction commit, etc.

This patch split largest free order group lists and average fragment
size lists into other two lists for IOPS/fast storage groups, and
cr 0 / cr 1 group scanning for metadata block allocation in following
order:

cr 0 on largest free order IOPS group list
cr 1 on average fragment size IOPS group list
cr 0 on largest free order non-IOPS group list
cr 1 on average fragment size non-IOPS group list
cr >= 2 perform the linear search as before

Non-metadata block allocation does not allocate from the IOPS groups.

Add for mke2fs an option to mark which blocks are in the IOPS region
of storage at format time:

-E iops=0-1024G,4096-8192G

so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
group descriptors to decide which groups to allocate dynamic filesystem
metadata.

Signed-off-by: Bobi Jam <[email protected]>
---
fs/ext4/balloc.c | 2 +-
fs/ext4/ext4.h | 12 +++++
fs/ext4/mballoc.c | 154 ++++++++++++++++++++++++++++++++++++++++++------------
3 files changed, 134 insertions(+), 34 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index c1edde8..7b1b3ec 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -739,7 +739,7 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode,
ar.inode = inode;
ar.goal = goal;
ar.len = count ? *count : 1;
- ar.flags = flags;
+ ar.flags = flags | EXT4_MB_HINT_METADATA;

ret = ext4_mb_new_blocks(handle, &ar, errp);
if (count)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8104a21..3444b6e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -382,6 +382,7 @@ struct flex_groups {
#define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */
#define EXT4_BG_BLOCK_UNINIT 0x0002 /* Block bitmap not in use */
#define EXT4_BG_INODE_ZEROED 0x0004 /* On-disk itable initialized to zero */
+#define EXT4_BG_IOPS 0x0010 /* In IOPS/fast storage */

/*
* Macro-instructions used to manage group descriptors
@@ -1112,6 +1113,8 @@ struct ext4_inode_info {
#define EXT2_FLAGS_UNSIGNED_HASH 0x0002 /* Unsigned dirhash in use */
#define EXT2_FLAGS_TEST_FILESYS 0x0004 /* to test development code */

+#define EXT2_FLAGS_HAS_IOPS 0x0080 /* has IOPS storage */
+
/*
* Mount flags set via mount options or defaults
*/
@@ -1514,8 +1517,12 @@ struct ext4_sb_info {
atomic_t s_retry_alloc_pending;
struct list_head *s_mb_avg_fragment_size;
rwlock_t *s_mb_avg_fragment_size_locks;
+ struct list_head *s_avg_fragment_size_list_iops; /* avg_frament_size for IOPS groups */
+ rwlock_t *s_avg_fragment_size_locks_iops;
struct list_head *s_mb_largest_free_orders;
rwlock_t *s_mb_largest_free_orders_locks;
+ struct list_head *s_largest_free_orders_list_iops; /* largest_free_orders for IOPS grps */
+ rwlock_t *s_largest_free_orders_locks_iops;

/* tunables */
unsigned long s_stripe;
@@ -3366,6 +3373,7 @@ struct ext4_group_info {
#define EXT4_GROUP_INFO_IBITMAP_CORRUPT \
(1 << EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT)
#define EXT4_GROUP_INFO_BBITMAP_READ_BIT 4
+#define EXT4_GROUP_INFO_IOPS_BIT 5

#define EXT4_MB_GRP_NEED_INIT(grp) \
(test_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &((grp)->bb_state)))
@@ -3382,6 +3390,10 @@ struct ext4_group_info {
(clear_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
#define EXT4_MB_GRP_TEST_AND_SET_READ(grp) \
(test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &((grp)->bb_state)))
+#define EXT4_MB_GRP_TEST_IOPS(grp) \
+ (test_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
+#define EXT4_MB_GRP_SET_IOPS(grp) \
+ (set_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))

#define EXT4_MAX_CONTENTION 8
#define EXT4_CONTENTION_THRESHOLD 2
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 20f67a2..6d218af 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -828,6 +828,8 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
+ rwlock_t *afs_locks;
+ struct list_head *afs_list;
int new_order;

if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
@@ -838,20 +840,23 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
if (new_order == grp->bb_avg_fragment_size_order)
return;

+ if (EXT4_MB_GRP_TEST_IOPS(grp)) {
+ afs_locks = sbi->s_avg_fragment_size_locks_iops;
+ afs_list = sbi->s_avg_fragment_size_list_iops;
+ } else {
+ afs_locks = sbi->s_mb_avg_fragment_size_locks;
+ afs_list = sbi->s_mb_avg_fragment_size;
+ }
+
if (grp->bb_avg_fragment_size_order != -1) {
- write_lock(&sbi->s_mb_avg_fragment_size_locks[
- grp->bb_avg_fragment_size_order]);
+ write_lock(&afs_locks[grp->bb_avg_fragment_size_order]);
list_del(&grp->bb_avg_fragment_size_node);
- write_unlock(&sbi->s_mb_avg_fragment_size_locks[
- grp->bb_avg_fragment_size_order]);
+ write_unlock(&afs_locks[grp->bb_avg_fragment_size_order]);
}
grp->bb_avg_fragment_size_order = new_order;
- write_lock(&sbi->s_mb_avg_fragment_size_locks[
- grp->bb_avg_fragment_size_order]);
- list_add_tail(&grp->bb_avg_fragment_size_node,
- &sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]);
- write_unlock(&sbi->s_mb_avg_fragment_size_locks[
- grp->bb_avg_fragment_size_order]);
+ write_lock(&afs_locks[new_order]);
+ list_add_tail(&grp->bb_avg_fragment_size_node, &afs_list[new_order]);
+ write_unlock(&afs_locks[new_order]);
}

/*
@@ -863,6 +868,10 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
{
struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
struct ext4_group_info *iter, *grp;
+ bool iops = ac->ac_flags & EXT4_MB_HINT_METADATA &&
+ ac->ac_sb->s_flags & EXT2_FLAGS_HAS_IOPS;
+ rwlock_t *lfo_locks;
+ struct list_head *lfo_list;
int i;

if (ac->ac_status == AC_STATUS_FOUND)
@@ -871,17 +880,25 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
atomic_inc(&sbi->s_bal_cr0_bad_suggestions);

+ if (iops) {
+ lfo_locks = sbi->s_largest_free_orders_locks_iops;
+ lfo_list = sbi->s_largest_free_orders_list_iops;
+ } else {
+ lfo_locks = sbi->s_mb_largest_free_orders_locks;
+ lfo_list = sbi->s_mb_largest_free_orders;
+ }
+
grp = NULL;
for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
- if (list_empty(&sbi->s_mb_largest_free_orders[i]))
+ if (list_empty(&lfo_list[i]))
continue;
- read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
- if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
- read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
+ read_lock(&lfo_locks[i]);
+ if (list_empty(&lfo_list[i])) {
+ read_unlock(&lfo_locks[i]);
continue;
}
grp = NULL;
- list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
+ list_for_each_entry(iter, &lfo_list[i],
bb_largest_free_order_node) {
if (sbi->s_mb_stats)
atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
@@ -890,7 +907,7 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
break;
}
}
- read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
+ read_unlock(&lfo_locks[i]);
if (grp)
break;
}
@@ -913,6 +930,10 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
{
struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
struct ext4_group_info *grp = NULL, *iter;
+ bool iops = ac->ac_flags & EXT4_MB_HINT_METADATA &&
+ ac->ac_sb->s_flags & EXT2_FLAGS_HAS_IOPS;
+ rwlock_t *afs_locks;
+ struct list_head *afs_list;
int i;

if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
@@ -920,16 +941,24 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
}

+ if (iops) {
+ afs_locks = sbi->s_avg_fragment_size_locks_iops;
+ afs_list = sbi->s_avg_fragment_size_list_iops;
+ } else {
+ afs_locks = sbi->s_mb_avg_fragment_size_locks;
+ afs_list = sbi->s_mb_avg_fragment_size;
+ }
+
for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
i < MB_NUM_ORDERS(ac->ac_sb); i++) {
- if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
+ if (list_empty(&afs_list[i]))
continue;
- read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
- if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
- read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
+ read_lock(&afs_locks[i]);
+ if (list_empty(&afs_list[i])) {
+ read_unlock(&afs_locks[i]);
continue;
}
- list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
+ list_for_each_entry(iter, &afs_list[i],
bb_avg_fragment_size_node) {
if (sbi->s_mb_stats)
atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
@@ -938,7 +967,7 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
break;
}
}
- read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
+ read_unlock(&afs_locks[i]);
if (grp)
break;
}
@@ -947,7 +976,15 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
*group = grp->bb_group;
ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
} else {
- *new_cr = 2;
+ if (iops) {
+ /* cannot find proper group in IOPS storage,
+ * fall back to cr0 for non-IOPS groups.
+ */
+ ac->ac_flags &= ~EXT4_MB_HINT_METADATA;
+ *new_cr = 0;
+ } else {
+ *new_cr = 2;
+ }
}
}

@@ -1030,6 +1067,8 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
+ rwlock_t *lfo_locks;
+ struct list_head *lfo_list;
int i;

for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
@@ -1042,21 +1081,24 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
return;
}

+ if (EXT4_MB_GRP_TEST_IOPS(grp)) {
+ lfo_locks = sbi->s_largest_free_orders_locks_iops;
+ lfo_list = sbi->s_largest_free_orders_list_iops;
+ } else {
+ lfo_locks = sbi->s_mb_largest_free_orders_locks;
+ lfo_list = sbi->s_mb_largest_free_orders;
+ }
+
if (grp->bb_largest_free_order >= 0) {
- write_lock(&sbi->s_mb_largest_free_orders_locks[
- grp->bb_largest_free_order]);
+ write_lock(&lfo_locks[grp->bb_largest_free_order]);
list_del_init(&grp->bb_largest_free_order_node);
- write_unlock(&sbi->s_mb_largest_free_orders_locks[
- grp->bb_largest_free_order]);
+ write_unlock(&lfo_locks[grp->bb_largest_free_order]);
}
grp->bb_largest_free_order = i;
if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
- write_lock(&sbi->s_mb_largest_free_orders_locks[
- grp->bb_largest_free_order]);
- list_add_tail(&grp->bb_largest_free_order_node,
- &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
- write_unlock(&sbi->s_mb_largest_free_orders_locks[
- grp->bb_largest_free_order]);
+ write_lock(&lfo_locks[i]);
+ list_add_tail(&grp->bb_largest_free_order_node, &lfo_list[i]);
+ write_unlock(&lfo_locks[i]);
}
}

@@ -3150,6 +3192,8 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
init_rwsem(&meta_group_info[i]->alloc_sem);
meta_group_info[i]->bb_free_root = RB_ROOT;
+ if (desc->bg_flags & EXT4_BG_IOPS)
+ EXT4_MB_GRP_SET_IOPS(meta_group_info[i]);
INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
@@ -3423,6 +3467,24 @@ int ext4_mb_init(struct super_block *sb)
INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]);
rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]);
}
+ sbi->s_avg_fragment_size_list_iops =
+ kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
+ GFP_KERNEL);
+ if (!sbi->s_avg_fragment_size_list_iops) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ sbi->s_avg_fragment_size_locks_iops =
+ kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
+ GFP_KERNEL);
+ if (!sbi->s_avg_fragment_size_locks_iops) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
+ INIT_LIST_HEAD(&sbi->s_avg_fragment_size_list_iops[i]);
+ rwlock_init(&sbi->s_avg_fragment_size_locks_iops[i]);
+ }
sbi->s_mb_largest_free_orders =
kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
GFP_KERNEL);
@@ -3441,6 +3503,24 @@ int ext4_mb_init(struct super_block *sb)
INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
}
+ sbi->s_largest_free_orders_list_iops =
+ kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
+ GFP_KERNEL);
+ if (!sbi->s_largest_free_orders_list_iops) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ sbi->s_largest_free_orders_locks_iops =
+ kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
+ GFP_KERNEL);
+ if (!sbi->s_largest_free_orders_locks_iops) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
+ INIT_LIST_HEAD(&sbi->s_largest_free_orders_list_iops[i]);
+ rwlock_init(&sbi->s_largest_free_orders_locks_iops[i]);
+ }

spin_lock_init(&sbi->s_md_lock);
sbi->s_mb_free_pending = 0;
@@ -3512,8 +3592,12 @@ int ext4_mb_init(struct super_block *sb)
out:
kfree(sbi->s_mb_avg_fragment_size);
kfree(sbi->s_mb_avg_fragment_size_locks);
+ kfree(sbi->s_avg_fragment_size_list_iops);
+ kfree(sbi->s_avg_fragment_size_locks_iops);
kfree(sbi->s_mb_largest_free_orders);
kfree(sbi->s_mb_largest_free_orders_locks);
+ kfree(sbi->s_largest_free_orders_list_iops);
+ kfree(sbi->s_largest_free_orders_locks_iops);
kfree(sbi->s_mb_offsets);
sbi->s_mb_offsets = NULL;
kfree(sbi->s_mb_maxs);
@@ -3582,8 +3666,12 @@ int ext4_mb_release(struct super_block *sb)
}
kfree(sbi->s_mb_avg_fragment_size);
kfree(sbi->s_mb_avg_fragment_size_locks);
+ kfree(sbi->s_avg_fragment_size_list_iops);
+ kfree(sbi->s_avg_fragment_size_locks_iops);
kfree(sbi->s_mb_largest_free_orders);
kfree(sbi->s_mb_largest_free_orders_locks);
+ kfree(sbi->s_largest_free_orders_list_iops);
+ kfree(sbi->s_largest_free_orders_locks_iops);
kfree(sbi->s_mb_offsets);
kfree(sbi->s_mb_maxs);
iput(sbi->s_buddy_cache);
--
1.8.3.1



2023-08-03 00:10:59

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 1/2] ext4: optimize metadata allocation for hybrid LUNs

On Jul 27, 2023, at 5:45 PM, Bobi Jam <[email protected]> wrote:
>
> With LVM it is possible to create an LV with SSD storage at the
> beginning of the LV and HDD storage at the end of the LV, and use that
> to separate ext4 metadata allocations (that need small random IOs)
> from data allocations (that are better suited for large sequential
> IOs) depending on the type of underlying storage. Between 0.5-1.0% of
> the filesystem capacity would need to be high-IOPS storage in order to
> hold all of the internal metadata.
>
> This would improve performance for inode and other metadata access,
> such as ls, find, e2fsck, and in general improve file access latency,
> modification, truncate, unlink, transaction commit, etc.
>
> This patch split largest free order group lists and average fragment
> size lists into other two lists for IOPS/fast storage groups, and
> cr 0 / cr 1 group scanning for metadata block allocation in following
> order:
>
> cr 0 on largest free order IOPS group list
> cr 1 on average fragment size IOPS group list
> cr 0 on largest free order non-IOPS group list
> cr 1 on average fragment size non-IOPS group list
> cr >= 2 perform the linear search as before
>
> Non-metadata block allocation does not allocate from the IOPS groups.
>
> Add for mke2fs an option to mark which blocks are in the IOPS region
> of storage at format time:
>
> -E iops=0-1024G,4096-8192G
>
> so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
> group descriptors to decide which groups to allocate dynamic filesystem
> metadata.

Ted, Ritesh, Ojaswin,
I would appreciate your review and comments on these two patches.

They are a followup to the discussion about hybrid LVM devices a few
weeks ago in https://www.spinics.net/lists/linux-ext4/msg90237.html
"packed_meta_blocks=1 incompatible with resize2fs?".

The 2/2 patch adds an option to mke2fs to mark groups on the SSD/NVMe
storage with the "EXT4_BG_IOPS" flag. Together with mke2fs using the
existing sparse_super2, flex_bg, and packed_meta_blocks, this would
allocate all static metadata to the start of the device. The 1/2 patch
changes mballoc to keep two separate allocation lists/trees based on
the IOPS flag on each group.

This has the dual benefit that all filesystem metadata (typically 4KiB
fragmented allocation/read/write) is on fast storage, and these blocks
also do not interfere with (usually larger) data allocation/read/write
(both in terms of allocation fragmentation and contending IOPS/seeks).
This should help both normal IO usage, as well as e2fsck significantly
(virtually all e2fsck IO would go to the SSD/NVMe storage).


The implementation is relatively simple as you can see. Currently it
has a flag in the superblock to indicate that IOPS groups are available
during block allocation, but even that is not strictly needed (it could
be detected at GDT reading time). Metadata allocations prefer to use
the IOPS groups (if available), otherwise fall back to regular groups.

For our usage, this is a "soft" feature that does not affect compatibility.
It would be mostly harmless if the filesystem was mounted with an older
kernel. At worst some performance loss that would disappear again over
time, but this would happen rarely I think.


This doesn't *directly* address filesystem resize that Roberto was asking
about, but having IOPS groups used only for metadata would make it easier
to resize later (if only adding HDD capacity). Alternately, because the
individual groups are marked with the IOPS flag, a second (third, fourth)
flash region could be added at the end of the current filesystem to hold
the new bitmaps and inode tables would be relatively straight forward to
add on top of this. There might be some work needed for mke2fs to honor
the "resize" option with packed_meta_blocks, but maybe not much more.

We basically never resize filesystems, so this is not of any interest to
implement at this point.

Cheers, Andreas

> Signed-off-by: Bobi Jam <[email protected]>
> ---
> fs/ext4/balloc.c | 2 +-
> fs/ext4/ext4.h | 12 +++++
> fs/ext4/mballoc.c | 154 ++++++++++++++++++++++++++++++++++++++++++------------
> 3 files changed, 134 insertions(+), 34 deletions(-)
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index c1edde8..7b1b3ec 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -739,7 +739,7 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode,
> ar.inode = inode;
> ar.goal = goal;
> ar.len = count ? *count : 1;
> - ar.flags = flags;
> + ar.flags = flags | EXT4_MB_HINT_METADATA;
>
> ret = ext4_mb_new_blocks(handle, &ar, errp);
> if (count)
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 8104a21..3444b6e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -382,6 +382,7 @@ struct flex_groups {
> #define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */
> #define EXT4_BG_BLOCK_UNINIT 0x0002 /* Block bitmap not in use */
> #define EXT4_BG_INODE_ZEROED 0x0004 /* On-disk itable initialized to zero */
> +#define EXT4_BG_IOPS 0x0010 /* In IOPS/fast storage */
>
> /*
> * Macro-instructions used to manage group descriptors
> @@ -1112,6 +1113,8 @@ struct ext4_inode_info {
> #define EXT2_FLAGS_UNSIGNED_HASH 0x0002 /* Unsigned dirhash in use */
> #define EXT2_FLAGS_TEST_FILESYS 0x0004 /* to test development code */
>
> +#define EXT2_FLAGS_HAS_IOPS 0x0080 /* has IOPS storage */
> +
> /*
> * Mount flags set via mount options or defaults
> */
> @@ -1514,8 +1517,12 @@ struct ext4_sb_info {
> atomic_t s_retry_alloc_pending;
> struct list_head *s_mb_avg_fragment_size;
> rwlock_t *s_mb_avg_fragment_size_locks;
> + struct list_head *s_avg_fragment_size_list_iops; /* avg_frament_size for IOPS groups */
> + rwlock_t *s_avg_fragment_size_locks_iops;
> struct list_head *s_mb_largest_free_orders;
> rwlock_t *s_mb_largest_free_orders_locks;
> + struct list_head *s_largest_free_orders_list_iops; /* largest_free_orders for IOPS grps */
> + rwlock_t *s_largest_free_orders_locks_iops;
>
> /* tunables */
> unsigned long s_stripe;
> @@ -3366,6 +3373,7 @@ struct ext4_group_info {
> #define EXT4_GROUP_INFO_IBITMAP_CORRUPT \
> (1 << EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT)
> #define EXT4_GROUP_INFO_BBITMAP_READ_BIT 4
> +#define EXT4_GROUP_INFO_IOPS_BIT 5
>
> #define EXT4_MB_GRP_NEED_INIT(grp) \
> (test_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &((grp)->bb_state)))
> @@ -3382,6 +3390,10 @@ struct ext4_group_info {
> (clear_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
> #define EXT4_MB_GRP_TEST_AND_SET_READ(grp) \
> (test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &((grp)->bb_state)))
> +#define EXT4_MB_GRP_TEST_IOPS(grp) \
> + (test_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
> +#define EXT4_MB_GRP_SET_IOPS(grp) \
> + (set_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
>
> #define EXT4_MAX_CONTENTION 8
> #define EXT4_CONTENTION_THRESHOLD 2
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 20f67a2..6d218af 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -828,6 +828,8 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
> mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
> {
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> + rwlock_t *afs_locks;
> + struct list_head *afs_list;
> int new_order;
>
> if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
> @@ -838,20 +840,23 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
> if (new_order == grp->bb_avg_fragment_size_order)
> return;
>
> + if (EXT4_MB_GRP_TEST_IOPS(grp)) {
> + afs_locks = sbi->s_avg_fragment_size_locks_iops;
> + afs_list = sbi->s_avg_fragment_size_list_iops;
> + } else {
> + afs_locks = sbi->s_mb_avg_fragment_size_locks;
> + afs_list = sbi->s_mb_avg_fragment_size;
> + }
> +
> if (grp->bb_avg_fragment_size_order != -1) {
> - write_lock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> + write_lock(&afs_locks[grp->bb_avg_fragment_size_order]);
> list_del(&grp->bb_avg_fragment_size_node);
> - write_unlock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> + write_unlock(&afs_locks[grp->bb_avg_fragment_size_order]);
> }
> grp->bb_avg_fragment_size_order = new_order;
> - write_lock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> - list_add_tail(&grp->bb_avg_fragment_size_node,
> - &sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]);
> - write_unlock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> + write_lock(&afs_locks[new_order]);
> + list_add_tail(&grp->bb_avg_fragment_size_node, &afs_list[new_order]);
> + write_unlock(&afs_locks[new_order]);
> }
>
> /*
> @@ -863,6 +868,10 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> {
> struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> struct ext4_group_info *iter, *grp;
> + bool iops = ac->ac_flags & EXT4_MB_HINT_METADATA &&
> + ac->ac_sb->s_flags & EXT2_FLAGS_HAS_IOPS;
> + rwlock_t *lfo_locks;
> + struct list_head *lfo_list;
> int i;
>
> if (ac->ac_status == AC_STATUS_FOUND)
> @@ -871,17 +880,25 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
> atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
>
> + if (iops) {
> + lfo_locks = sbi->s_largest_free_orders_locks_iops;
> + lfo_list = sbi->s_largest_free_orders_list_iops;
> + } else {
> + lfo_locks = sbi->s_mb_largest_free_orders_locks;
> + lfo_list = sbi->s_mb_largest_free_orders;
> + }
> +
> grp = NULL;
> for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> - if (list_empty(&sbi->s_mb_largest_free_orders[i]))
> + if (list_empty(&lfo_list[i]))
> continue;
> - read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
> - if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
> - read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> + read_lock(&lfo_locks[i]);
> + if (list_empty(&lfo_list[i])) {
> + read_unlock(&lfo_locks[i]);
> continue;
> }
> grp = NULL;
> - list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
> + list_for_each_entry(iter, &lfo_list[i],
> bb_largest_free_order_node) {
> if (sbi->s_mb_stats)
> atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
> @@ -890,7 +907,7 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> break;
> }
> }
> - read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> + read_unlock(&lfo_locks[i]);
> if (grp)
> break;
> }
> @@ -913,6 +930,10 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> {
> struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> struct ext4_group_info *grp = NULL, *iter;
> + bool iops = ac->ac_flags & EXT4_MB_HINT_METADATA &&
> + ac->ac_sb->s_flags & EXT2_FLAGS_HAS_IOPS;
> + rwlock_t *afs_locks;
> + struct list_head *afs_list;
> int i;
>
> if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
> @@ -920,16 +941,24 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
> }
>
> + if (iops) {
> + afs_locks = sbi->s_avg_fragment_size_locks_iops;
> + afs_list = sbi->s_avg_fragment_size_list_iops;
> + } else {
> + afs_locks = sbi->s_mb_avg_fragment_size_locks;
> + afs_list = sbi->s_mb_avg_fragment_size;
> + }
> +
> for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
> i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> - if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
> + if (list_empty(&afs_list[i]))
> continue;
> - read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
> - if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
> - read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
> + read_lock(&afs_locks[i]);
> + if (list_empty(&afs_list[i])) {
> + read_unlock(&afs_locks[i]);
> continue;
> }
> - list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
> + list_for_each_entry(iter, &afs_list[i],
> bb_avg_fragment_size_node) {
> if (sbi->s_mb_stats)
> atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
> @@ -938,7 +967,7 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> break;
> }
> }
> - read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
> + read_unlock(&afs_locks[i]);
> if (grp)
> break;
> }
> @@ -947,7 +976,15 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> *group = grp->bb_group;
> ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
> } else {
> - *new_cr = 2;
> + if (iops) {
> + /* cannot find proper group in IOPS storage,
> + * fall back to cr0 for non-IOPS groups.
> + */
> + ac->ac_flags &= ~EXT4_MB_HINT_METADATA;
> + *new_cr = 0;
> + } else {
> + *new_cr = 2;
> + }
> }
> }
>
> @@ -1030,6 +1067,8 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
> {
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> + rwlock_t *lfo_locks;
> + struct list_head *lfo_list;
> int i;
>
> for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
> @@ -1042,21 +1081,24 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> return;
> }
>
> + if (EXT4_MB_GRP_TEST_IOPS(grp)) {
> + lfo_locks = sbi->s_largest_free_orders_locks_iops;
> + lfo_list = sbi->s_largest_free_orders_list_iops;
> + } else {
> + lfo_locks = sbi->s_mb_largest_free_orders_locks;
> + lfo_list = sbi->s_mb_largest_free_orders;
> + }
> +
> if (grp->bb_largest_free_order >= 0) {
> - write_lock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> + write_lock(&lfo_locks[grp->bb_largest_free_order]);
> list_del_init(&grp->bb_largest_free_order_node);
> - write_unlock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> + write_unlock(&lfo_locks[grp->bb_largest_free_order]);
> }
> grp->bb_largest_free_order = i;
> if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
> - write_lock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> - list_add_tail(&grp->bb_largest_free_order_node,
> - &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> - write_unlock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> + write_lock(&lfo_locks[i]);
> + list_add_tail(&grp->bb_largest_free_order_node, &lfo_list[i]);
> + write_unlock(&lfo_locks[i]);
> }
> }
>
> @@ -3150,6 +3192,8 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
> INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
> init_rwsem(&meta_group_info[i]->alloc_sem);
> meta_group_info[i]->bb_free_root = RB_ROOT;
> + if (desc->bg_flags & EXT4_BG_IOPS)
> + EXT4_MB_GRP_SET_IOPS(meta_group_info[i]);
> INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
> INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
> meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
> @@ -3423,6 +3467,24 @@ int ext4_mb_init(struct super_block *sb)
> INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]);
> rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]);
> }
> + sbi->s_avg_fragment_size_list_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> + GFP_KERNEL);
> + if (!sbi->s_avg_fragment_size_list_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + sbi->s_avg_fragment_size_locks_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> + GFP_KERNEL);
> + if (!sbi->s_avg_fragment_size_locks_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> + INIT_LIST_HEAD(&sbi->s_avg_fragment_size_list_iops[i]);
> + rwlock_init(&sbi->s_avg_fragment_size_locks_iops[i]);
> + }
> sbi->s_mb_largest_free_orders =
> kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> GFP_KERNEL);
> @@ -3441,6 +3503,24 @@ int ext4_mb_init(struct super_block *sb)
> INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
> rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
> }
> + sbi->s_largest_free_orders_list_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> + GFP_KERNEL);
> + if (!sbi->s_largest_free_orders_list_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + sbi->s_largest_free_orders_locks_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> + GFP_KERNEL);
> + if (!sbi->s_largest_free_orders_locks_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> + INIT_LIST_HEAD(&sbi->s_largest_free_orders_list_iops[i]);
> + rwlock_init(&sbi->s_largest_free_orders_locks_iops[i]);
> + }
>
> spin_lock_init(&sbi->s_md_lock);
> sbi->s_mb_free_pending = 0;
> @@ -3512,8 +3592,12 @@ int ext4_mb_init(struct super_block *sb)
> out:
> kfree(sbi->s_mb_avg_fragment_size);
> kfree(sbi->s_mb_avg_fragment_size_locks);
> + kfree(sbi->s_avg_fragment_size_list_iops);
> + kfree(sbi->s_avg_fragment_size_locks_iops);
> kfree(sbi->s_mb_largest_free_orders);
> kfree(sbi->s_mb_largest_free_orders_locks);
> + kfree(sbi->s_largest_free_orders_list_iops);
> + kfree(sbi->s_largest_free_orders_locks_iops);
> kfree(sbi->s_mb_offsets);
> sbi->s_mb_offsets = NULL;
> kfree(sbi->s_mb_maxs);
> @@ -3582,8 +3666,12 @@ int ext4_mb_release(struct super_block *sb)
> }
> kfree(sbi->s_mb_avg_fragment_size);
> kfree(sbi->s_mb_avg_fragment_size_locks);
> + kfree(sbi->s_avg_fragment_size_list_iops);
> + kfree(sbi->s_avg_fragment_size_locks_iops);
> kfree(sbi->s_mb_largest_free_orders);
> kfree(sbi->s_mb_largest_free_orders_locks);
> + kfree(sbi->s_largest_free_orders_list_iops);
> + kfree(sbi->s_largest_free_orders_locks_iops);
> kfree(sbi->s_mb_offsets);
> kfree(sbi->s_mb_maxs);
> iput(sbi->s_buddy_cache);
> --
> 1.8.3.1
>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2023-08-03 12:36:38

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH 1/2] ext4: optimize metadata allocation for hybrid LUNs

Bobi Jam <[email protected]> writes:

> With LVM it is possible to create an LV with SSD storage at the
> beginning of the LV and HDD storage at the end of the LV, and use that
> to separate ext4 metadata allocations (that need small random IOs)
> from data allocations (that are better suited for large sequential
> IOs) depending on the type of underlying storage. Between 0.5-1.0% of
> the filesystem capacity would need to be high-IOPS storage in order to
> hold all of the internal metadata.
>
> This would improve performance for inode and other metadata access,
> such as ls, find, e2fsck, and in general improve file access latency,
> modification, truncate, unlink, transaction commit, etc.
>
> This patch split largest free order group lists and average fragment
> size lists into other two lists for IOPS/fast storage groups, and
> cr 0 / cr 1 group scanning for metadata block allocation in following
> order:
>
> cr 0 on largest free order IOPS group list
> cr 1 on average fragment size IOPS group list
> cr 0 on largest free order non-IOPS group list
> cr 1 on average fragment size non-IOPS group list
> cr >= 2 perform the linear search as before

Yes. The implementation looks straight forward to me.


>
> Non-metadata block allocation does not allocate from the IOPS groups.
>
> Add for mke2fs an option to mark which blocks are in the IOPS region
> of storage at format time:
>
> -E iops=0-1024G,4096-8192G

However few things to discuss here are -

1. What happens when the hdd space for data gets fully exhausted? AFAICS, the
allocation for data blocks will still succeed, however we won't be able
to make use of optimized scanning any more. Because we search within
iops lists only when EXT4_MB_HINT_METADATA is set in ac->ac_flags.

2. Similarly what happens when the ssd space for metadata gets full.
In this case we keep falling back to cr2 for allocation and we don't
utilize optimize_scanning to find the block groups from hdd space to
allocate from.

3. So it seems after a period of time, these iops lists can have block
groups belonging to differnt ssds. Could this cause the metadata
allocation of related inodes to come from different ssds.
Will this be optimal? Checking on this...
...On checking further on this, we start with a goal group and we
at least scan s_mb_max_linear_groups (4) linearly. So it's unlikely that
we frequently allocate metadata blocks from different SSDs.

4. Ok looking into this, do we even require the iops lists for metadata
allocations? Do we allocate more than 1 blocks for metadata? If not then
maintaining these iops lists for metadata allocation isn't really
helpful. On the other hand it does make sense to maintain it when we
allow data allocations from these ssds when hdds gets full.

5. Did we run any benchmarks with this yet? What kind of gains we are
looking for? Do we have any numbers for this?

6. I couldn't stop but start to think of...
Should there also be a provision from the user to pass hot/cold data
types which we can use as a hint within the filesystem to allocate from
ssd v/s hdd? Does it even make sense to think in this direction?

-ritesh


>
> so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
> group descriptors to decide which groups to allocate dynamic filesystem
> metadata.
>
> Signed-off-by: Bobi Jam <[email protected]>
> ---
> fs/ext4/balloc.c | 2 +-
> fs/ext4/ext4.h | 12 +++++
> fs/ext4/mballoc.c | 154 ++++++++++++++++++++++++++++++++++++++++++------------
> 3 files changed, 134 insertions(+), 34 deletions(-)
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index c1edde8..7b1b3ec 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -739,7 +739,7 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode,
> ar.inode = inode;
> ar.goal = goal;
> ar.len = count ? *count : 1;
> - ar.flags = flags;
> + ar.flags = flags | EXT4_MB_HINT_METADATA;
>
> ret = ext4_mb_new_blocks(handle, &ar, errp);
> if (count)
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 8104a21..3444b6e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -382,6 +382,7 @@ struct flex_groups {
> #define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */
> #define EXT4_BG_BLOCK_UNINIT 0x0002 /* Block bitmap not in use */
> #define EXT4_BG_INODE_ZEROED 0x0004 /* On-disk itable initialized to zero */
> +#define EXT4_BG_IOPS 0x0010 /* In IOPS/fast storage */
>
> /*
> * Macro-instructions used to manage group descriptors
> @@ -1112,6 +1113,8 @@ struct ext4_inode_info {
> #define EXT2_FLAGS_UNSIGNED_HASH 0x0002 /* Unsigned dirhash in use */
> #define EXT2_FLAGS_TEST_FILESYS 0x0004 /* to test development code */
>
> +#define EXT2_FLAGS_HAS_IOPS 0x0080 /* has IOPS storage */
> +
> /*
> * Mount flags set via mount options or defaults
> */
> @@ -1514,8 +1517,12 @@ struct ext4_sb_info {
> atomic_t s_retry_alloc_pending;
> struct list_head *s_mb_avg_fragment_size;
> rwlock_t *s_mb_avg_fragment_size_locks;
> + struct list_head *s_avg_fragment_size_list_iops; /* avg_frament_size for IOPS groups */
> + rwlock_t *s_avg_fragment_size_locks_iops;
> struct list_head *s_mb_largest_free_orders;
> rwlock_t *s_mb_largest_free_orders_locks;
> + struct list_head *s_largest_free_orders_list_iops; /* largest_free_orders for IOPS grps */
> + rwlock_t *s_largest_free_orders_locks_iops;
>
> /* tunables */
> unsigned long s_stripe;
> @@ -3366,6 +3373,7 @@ struct ext4_group_info {
> #define EXT4_GROUP_INFO_IBITMAP_CORRUPT \
> (1 << EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT)
> #define EXT4_GROUP_INFO_BBITMAP_READ_BIT 4
> +#define EXT4_GROUP_INFO_IOPS_BIT 5
>
> #define EXT4_MB_GRP_NEED_INIT(grp) \
> (test_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &((grp)->bb_state)))
> @@ -3382,6 +3390,10 @@ struct ext4_group_info {
> (clear_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
> #define EXT4_MB_GRP_TEST_AND_SET_READ(grp) \
> (test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &((grp)->bb_state)))
> +#define EXT4_MB_GRP_TEST_IOPS(grp) \
> + (test_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
> +#define EXT4_MB_GRP_SET_IOPS(grp) \
> + (set_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
>
> #define EXT4_MAX_CONTENTION 8
> #define EXT4_CONTENTION_THRESHOLD 2
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 20f67a2..6d218af 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -828,6 +828,8 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
> mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
> {
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> + rwlock_t *afs_locks;
> + struct list_head *afs_list;
> int new_order;
>
> if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
> @@ -838,20 +840,23 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
> if (new_order == grp->bb_avg_fragment_size_order)
> return;
>
> + if (EXT4_MB_GRP_TEST_IOPS(grp)) {
> + afs_locks = sbi->s_avg_fragment_size_locks_iops;
> + afs_list = sbi->s_avg_fragment_size_list_iops;
> + } else {
> + afs_locks = sbi->s_mb_avg_fragment_size_locks;
> + afs_list = sbi->s_mb_avg_fragment_size;
> + }
> +
> if (grp->bb_avg_fragment_size_order != -1) {
> - write_lock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> + write_lock(&afs_locks[grp->bb_avg_fragment_size_order]);
> list_del(&grp->bb_avg_fragment_size_node);
> - write_unlock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> + write_unlock(&afs_locks[grp->bb_avg_fragment_size_order]);
> }
> grp->bb_avg_fragment_size_order = new_order;
> - write_lock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> - list_add_tail(&grp->bb_avg_fragment_size_node,
> - &sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]);
> - write_unlock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> + write_lock(&afs_locks[new_order]);
> + list_add_tail(&grp->bb_avg_fragment_size_node, &afs_list[new_order]);
> + write_unlock(&afs_locks[new_order]);
> }
>
> /*
> @@ -863,6 +868,10 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> {
> struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> struct ext4_group_info *iter, *grp;
> + bool iops = ac->ac_flags & EXT4_MB_HINT_METADATA &&
> + ac->ac_sb->s_flags & EXT2_FLAGS_HAS_IOPS;
> + rwlock_t *lfo_locks;
> + struct list_head *lfo_list;
> int i;
>
> if (ac->ac_status == AC_STATUS_FOUND)
> @@ -871,17 +880,25 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
> atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
>
> + if (iops) {
> + lfo_locks = sbi->s_largest_free_orders_locks_iops;
> + lfo_list = sbi->s_largest_free_orders_list_iops;
> + } else {
> + lfo_locks = sbi->s_mb_largest_free_orders_locks;
> + lfo_list = sbi->s_mb_largest_free_orders;
> + }
> +
> grp = NULL;
> for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> - if (list_empty(&sbi->s_mb_largest_free_orders[i]))
> + if (list_empty(&lfo_list[i]))
> continue;
> - read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
> - if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
> - read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> + read_lock(&lfo_locks[i]);
> + if (list_empty(&lfo_list[i])) {
> + read_unlock(&lfo_locks[i]);
> continue;
> }
> grp = NULL;
> - list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
> + list_for_each_entry(iter, &lfo_list[i],
> bb_largest_free_order_node) {
> if (sbi->s_mb_stats)
> atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
> @@ -890,7 +907,7 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> break;
> }
> }
> - read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> + read_unlock(&lfo_locks[i]);
> if (grp)
> break;
> }
> @@ -913,6 +930,10 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> {
> struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> struct ext4_group_info *grp = NULL, *iter;
> + bool iops = ac->ac_flags & EXT4_MB_HINT_METADATA &&
> + ac->ac_sb->s_flags & EXT2_FLAGS_HAS_IOPS;
> + rwlock_t *afs_locks;
> + struct list_head *afs_list;
> int i;
>
> if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
> @@ -920,16 +941,24 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
> }
>
> + if (iops) {
> + afs_locks = sbi->s_avg_fragment_size_locks_iops;
> + afs_list = sbi->s_avg_fragment_size_list_iops;
> + } else {
> + afs_locks = sbi->s_mb_avg_fragment_size_locks;
> + afs_list = sbi->s_mb_avg_fragment_size;
> + }
> +
> for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
> i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> - if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
> + if (list_empty(&afs_list[i]))
> continue;
> - read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
> - if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
> - read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
> + read_lock(&afs_locks[i]);
> + if (list_empty(&afs_list[i])) {
> + read_unlock(&afs_locks[i]);
> continue;
> }
> - list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
> + list_for_each_entry(iter, &afs_list[i],
> bb_avg_fragment_size_node) {
> if (sbi->s_mb_stats)
> atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
> @@ -938,7 +967,7 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> break;
> }
> }
> - read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
> + read_unlock(&afs_locks[i]);
> if (grp)
> break;
> }
> @@ -947,7 +976,15 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> *group = grp->bb_group;
> ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
> } else {
> - *new_cr = 2;
> + if (iops) {
> + /* cannot find proper group in IOPS storage,
> + * fall back to cr0 for non-IOPS groups.
> + */
> + ac->ac_flags &= ~EXT4_MB_HINT_METADATA;
> + *new_cr = 0;
> + } else {
> + *new_cr = 2;
> + }
> }
> }
>
> @@ -1030,6 +1067,8 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
> {
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> + rwlock_t *lfo_locks;
> + struct list_head *lfo_list;
> int i;
>
> for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
> @@ -1042,21 +1081,24 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> return;
> }
>
> + if (EXT4_MB_GRP_TEST_IOPS(grp)) {
> + lfo_locks = sbi->s_largest_free_orders_locks_iops;
> + lfo_list = sbi->s_largest_free_orders_list_iops;
> + } else {
> + lfo_locks = sbi->s_mb_largest_free_orders_locks;
> + lfo_list = sbi->s_mb_largest_free_orders;
> + }
> +
> if (grp->bb_largest_free_order >= 0) {
> - write_lock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> + write_lock(&lfo_locks[grp->bb_largest_free_order]);
> list_del_init(&grp->bb_largest_free_order_node);
> - write_unlock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> + write_unlock(&lfo_locks[grp->bb_largest_free_order]);
> }
> grp->bb_largest_free_order = i;
> if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
> - write_lock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> - list_add_tail(&grp->bb_largest_free_order_node,
> - &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> - write_unlock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> + write_lock(&lfo_locks[i]);
> + list_add_tail(&grp->bb_largest_free_order_node, &lfo_list[i]);
> + write_unlock(&lfo_locks[i]);
> }
> }
>
> @@ -3150,6 +3192,8 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
> INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
> init_rwsem(&meta_group_info[i]->alloc_sem);
> meta_group_info[i]->bb_free_root = RB_ROOT;
> + if (desc->bg_flags & EXT4_BG_IOPS)
> + EXT4_MB_GRP_SET_IOPS(meta_group_info[i]);
> INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
> INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
> meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
> @@ -3423,6 +3467,24 @@ int ext4_mb_init(struct super_block *sb)
> INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]);
> rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]);
> }
> + sbi->s_avg_fragment_size_list_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> + GFP_KERNEL);
> + if (!sbi->s_avg_fragment_size_list_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + sbi->s_avg_fragment_size_locks_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> + GFP_KERNEL);
> + if (!sbi->s_avg_fragment_size_locks_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> + INIT_LIST_HEAD(&sbi->s_avg_fragment_size_list_iops[i]);
> + rwlock_init(&sbi->s_avg_fragment_size_locks_iops[i]);
> + }
> sbi->s_mb_largest_free_orders =
> kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> GFP_KERNEL);
> @@ -3441,6 +3503,24 @@ int ext4_mb_init(struct super_block *sb)
> INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
> rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
> }
> + sbi->s_largest_free_orders_list_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> + GFP_KERNEL);
> + if (!sbi->s_largest_free_orders_list_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + sbi->s_largest_free_orders_locks_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> + GFP_KERNEL);
> + if (!sbi->s_largest_free_orders_locks_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> + INIT_LIST_HEAD(&sbi->s_largest_free_orders_list_iops[i]);
> + rwlock_init(&sbi->s_largest_free_orders_locks_iops[i]);
> + }
>
> spin_lock_init(&sbi->s_md_lock);
> sbi->s_mb_free_pending = 0;
> @@ -3512,8 +3592,12 @@ int ext4_mb_init(struct super_block *sb)
> out:
> kfree(sbi->s_mb_avg_fragment_size);
> kfree(sbi->s_mb_avg_fragment_size_locks);
> + kfree(sbi->s_avg_fragment_size_list_iops);
> + kfree(sbi->s_avg_fragment_size_locks_iops);
> kfree(sbi->s_mb_largest_free_orders);
> kfree(sbi->s_mb_largest_free_orders_locks);
> + kfree(sbi->s_largest_free_orders_list_iops);
> + kfree(sbi->s_largest_free_orders_locks_iops);
> kfree(sbi->s_mb_offsets);
> sbi->s_mb_offsets = NULL;
> kfree(sbi->s_mb_maxs);
> @@ -3582,8 +3666,12 @@ int ext4_mb_release(struct super_block *sb)
> }
> kfree(sbi->s_mb_avg_fragment_size);
> kfree(sbi->s_mb_avg_fragment_size_locks);
> + kfree(sbi->s_avg_fragment_size_list_iops);
> + kfree(sbi->s_avg_fragment_size_locks_iops);
> kfree(sbi->s_mb_largest_free_orders);
> kfree(sbi->s_mb_largest_free_orders_locks);
> + kfree(sbi->s_largest_free_orders_list_iops);
> + kfree(sbi->s_largest_free_orders_locks_iops);
> kfree(sbi->s_mb_offsets);
> kfree(sbi->s_mb_maxs);
> iput(sbi->s_buddy_cache);
> --
> 1.8.3.1

2023-08-19 20:15:12

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 1/2] ext4: optimize metadata allocation for hybrid LUNs

On Aug 16, 2023, at 4:09 AM, Ritesh Harjani (IBM) <[email protected]> wrote:
>
> Andreas Dilger <[email protected]> writes:
>
>> On Aug 3, 2023, at 6:10 AM, Ritesh Harjani (IBM) <[email protected]> wrote:
>>> 1. What happens when the hdd space for data gets fully exhausted? AFAICS,
>>> the allocation for data blocks will still succeed, however we won't be
>>> able to make use of optimized scanning any more. Because we search within
>>> iops lists only when EXT4_MB_HINT_METADATA is set in ac->ac_flags.
>>
>> The intention for our usage is that data allocations should *only* come
>> from the HDD region of the device, and *not* from the IOPS (flash) region
>> of the device. The IOPS region will be comparatively small (0.5-1.0% of
>> the total device size) so using or not using this space will be mostly
>> meaningless to the overall filesystem usage, especially with a 1-5%
>> reserved blocks percentage that is the default for new filesystems.
>
> Yes, but when we give this functionality to non-enterprise users,
> everyone would like to take advantage of a faster performing ext4 using
> 1 ssd and few hdds or a smaller spare ssd and larger hdds. Then it could
> be that the space of iops region might not strictly be less than 1-2%
> and could be anywhere between 10-50% ;)
>
> Shouldn't we still support this class of usecase as well? ^^^
> So if the HDD gets full then the allocation should fallback to ssd for
> data blocks no?

It's true that this is possible, and I've thought about optionally
allowing e.g. "small files" to be allocated in the IOPS space while
"large files" are allocated only in the HDD space. This involves
"policy" which is always tricky to get right. What is "small" and
what is "large"? At what threshold do we *stop* putting small files
into the IOPS groups when they get too full, or increase the size of
"small" files if it isn't filling up quickly enough vs. the non-IOPS
groups? ...

I'd prefer to get the basic infrastructure working, and then we can
have the long discussions about how the policy should work for the
*next* patches, because those decisions do not have a permanent effect
on the on-disk format. :-)

> Or we can have a policy knob i.e. fallback_data_to_iops_region_thresh.
> So if the free space %age in the iops region is above 1% (can be changed
> by user) then the data allocations can fallback to iops region if it is
> unable to allocate data blocks from hdd region.
>
> echo %age_threshold > fallback_data_to_iops_region_thresh (default 1%)
>
> Fallback data allocations to iops region as long as we have free space
> %age of iops region above %age_threshold.

IMHO, a simple "too full" threshold is sub-optimal, because it means
suddenly the IOPS groups would get filled up with regular file data,
and in the likely case that old files are deleted to free up space,
the IOPS groups will still be filled with the new files.

My preference would be to have a "base small file size" (e.g. 64KB)
and a "fullness ratio" (free IOPS blocks / free non-IOPS blocks) and
use the fullness ratio to scale the "small file size". In case the
free IOPS blocks are very small (e.g. my initial proposal of 1% of
the total volume size, most of which would be filled with static
metadata) then e.g. files < 64 KB * 0.5% <= 3.2KB (probably *no* files,
since this is less than one block) would go into the IOPS groups.

If the ratio is more like 50% then files under 32KB could be allocated
into the IOPS groups, and if the non-IOPS groups end up filling faster
and the free space ratio is equal or even higher in the IOPS groups,
then 64KB or 128KB files can start to be allocated there.

> I am interested in knowing what do you think will be challenges in
> supporting resize with hybrid devices? Like if someone would like to
> add an additional ssd and do a resize, do you think all later metadata
> allocations can be fullfilled from this iops region?
>
> And what happens when someone adds hdds to existing ssds.
> I guess adding an hdd followed by resize operation can still allocate,
> GDT, block/inode bitmaps and inode tables etc for these block groups
> to sit on the resized hdd right.
>
> Are there any other challenges as well for such usecase?

At a high level, my expectation would be that resize would "work"
regardless of whether the IOPS groups have space or not, but how
optimal this is depends on how much free space is in the IOPS groups.
If the IOPS groups are over-provisioned, then it should be fine to
allocate bitmaps and inode table blocks in that space (with flex_bg).

It should also be possible to add more IOPS groups at the end of the
filesystem to help the resize to keep all metadata in the fast storage.
Allowing disjoint regions of flash storage is one of the reasons why
EXT4_BG_IOPS is a per-group flag and not just a "threshold" boundary
within the filesystem.


I only realized yesterday that online resize is completely disabled
for filesystems with sparse_super2. I think this is actually a mistake
because the problem looks like a bad interaction between sparse_super2
having only 2 backup groups, and the resize_inode feature expecting that
there are backup group descriptors in all of the usual places.

So I think it makes sense to change the "cannot do online resize" check
to only the case of sparse_super2 AND resize_inode being enabled. This
should be uncommon since sparse_super2 is mostly useful for filesystems
over 16TB in size, and resize_inode currently doesn't work in that case.

It does seem possible to fix resize_inode to work with sparse_super2 for
filesystems over 16TiB. Originally the reason resize_inode is disallowed
for filesystems > 16TiB is because of the 2^32 block number limit for
non-extent files, and because the increasing numbers of backup groups
means a large number of blocks need to be reserved. However, when using
sparse_super2 there are only 2 backup groups, and they can be located
within the first 16TiB (there is no reason that backup #2 has to be in
the last group) means resize_inode will have enough space in it to reserve
extra GDT blocks for the online resize to work smoothly, whether the IOPS
groups are in use or not. However, that is a separate project...

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2023-09-12 15:53:23

by Bobi Jam

[permalink] [raw]
Subject: [PATCH v3] ext4: optimize metadata allocation for hybrid LUNs

With LVM it is possible to create an LV with SSD storage at the
beginning of the LV and HDD storage at the end of the LV, and use that
to separate ext4 metadata allocations (that need small random IOs)
from data allocations (that are better suited for large sequential
IOs) depending on the type of underlying storage. Between 0.5-1.0% of
the filesystem capacity would need to be high-IOPS storage in order to
hold all of the internal metadata.

This would improve performance for inode and other metadata access,
such as ls, find, e2fsck, and in general improve file access latency,
modification, truncate, unlink, transaction commit, etc.

This patch split largest free order group lists and average fragment
size lists into other two lists for IOPS/fast storage groups, and
cr 0 / cr 1 group scanning for metadata block allocation in following
order:

if (allocate metadata blocks)
if (cr == 0)
try to find group in largest free order IOPS group list
if (cr == 1)
try to find group in fragment size IOPS group list
if (above two find failed)
fall through normal group lists as before
if (allocate data blocks)
try to find group in normal group lists as before
if (failed to find group in normal group && mb_enable_iops_data)
try to find group in IOPS groups

Non-metadata block allocation does not allocate from the IOPS groups
if non-IOPS groups are not used up.

Add for mke2fs an option to mark which blocks are in the IOPS region
of storage at format time:

-E iops=0-1024G,4096-8192G

so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
group descriptors to decide which groups to allocate dynamic
filesystem metadata.

Signed-off-by: Bobi Jam <[email protected]

--
v2->v3: add sysfs mb_enable_iops_data to enable data block allocation
from IOPS groups.
v1->v2: for metadata block allocation, search in IOPS list then normal
list.
---
fs/ext4/balloc.c | 2 +-
fs/ext4/ext4.h | 13 +++
fs/ext4/extents.c | 5 +-
fs/ext4/indirect.c | 5 +-
fs/ext4/mballoc.c | 229 +++++++++++++++++++++++++++++++++++++++++----
fs/ext4/sysfs.c | 4 +
6 files changed, 234 insertions(+), 24 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index c1edde817be8..7b1b3ec2650c 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -739,7 +739,7 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode,
ar.inode = inode;
ar.goal = goal;
ar.len = count ? *count : 1;
- ar.flags = flags;
+ ar.flags = flags | EXT4_MB_HINT_METADATA;

ret = ext4_mb_new_blocks(handle, &ar, errp);
if (count)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8104a21b001a..a8f21f63f5ff 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -382,6 +382,7 @@ struct flex_groups {
#define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */
#define EXT4_BG_BLOCK_UNINIT 0x0002 /* Block bitmap not in use */
#define EXT4_BG_INODE_ZEROED 0x0004 /* On-disk itable initialized to zero */
+#define EXT4_BG_IOPS 0x0010 /* In IOPS/fast storage */

/*
* Macro-instructions used to manage group descriptors
@@ -1112,6 +1113,8 @@ struct ext4_inode_info {
#define EXT2_FLAGS_UNSIGNED_HASH 0x0002 /* Unsigned dirhash in use */
#define EXT2_FLAGS_TEST_FILESYS 0x0004 /* to test development code */

+#define EXT2_FLAGS_HAS_IOPS 0x0080 /* has IOPS storage */
+
/*
* Mount flags set via mount options or defaults
*/
@@ -1514,8 +1517,12 @@ struct ext4_sb_info {
atomic_t s_retry_alloc_pending;
struct list_head *s_mb_avg_fragment_size;
rwlock_t *s_mb_avg_fragment_size_locks;
+ struct list_head *s_avg_fragment_size_list_iops; /* avg_frament_size for IOPS groups */
+ rwlock_t *s_avg_fragment_size_locks_iops;
struct list_head *s_mb_largest_free_orders;
rwlock_t *s_mb_largest_free_orders_locks;
+ struct list_head *s_largest_free_orders_list_iops; /* largest_free_orders for IOPS grps */
+ rwlock_t *s_largest_free_orders_locks_iops;

/* tunables */
unsigned long s_stripe;
@@ -1532,6 +1539,7 @@ struct ext4_sb_info {
unsigned long s_mb_last_start;
unsigned int s_mb_prefetch;
unsigned int s_mb_prefetch_limit;
+ unsigned int s_mb_enable_iops_data;

/* stats for buddy allocator */
atomic_t s_bal_reqs; /* number of reqs with len > 1 */
@@ -3366,6 +3374,7 @@ struct ext4_group_info {
#define EXT4_GROUP_INFO_IBITMAP_CORRUPT \
(1 << EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT)
#define EXT4_GROUP_INFO_BBITMAP_READ_BIT 4
+#define EXT4_GROUP_INFO_IOPS_BIT 5

#define EXT4_MB_GRP_NEED_INIT(grp) \
(test_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &((grp)->bb_state)))
@@ -3382,6 +3391,10 @@ struct ext4_group_info {
(clear_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
#define EXT4_MB_GRP_TEST_AND_SET_READ(grp) \
(test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &((grp)->bb_state)))
+#define EXT4_MB_GRP_TEST_IOPS(grp) \
+ (test_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
+#define EXT4_MB_GRP_SET_IOPS(grp) \
+ (set_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))

#define EXT4_MAX_CONTENTION 8
#define EXT4_CONTENTION_THRESHOLD 2
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 35703dce23a3..6bfa784a3dad 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4272,11 +4272,12 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ar.len = EXT4_NUM_B2C(sbi, offset+allocated);
ar.goal -= offset;
ar.logical -= offset;
- if (S_ISREG(inode->i_mode))
+ if (S_ISREG(inode->i_mode) &&
+ !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
ar.flags = EXT4_MB_HINT_DATA;
else
/* disable in-core preallocation for non-regular files */
- ar.flags = 0;
+ ar.flags = EXT4_MB_HINT_METADATA;
if (flags & EXT4_GET_BLOCKS_NO_NORMALIZE)
ar.flags |= EXT4_MB_HINT_NOPREALLOC;
if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index c68bebe7ff4b..e1042c4e8ce6 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -610,8 +610,11 @@ int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
memset(&ar, 0, sizeof(ar));
ar.inode = inode;
ar.logical = map->m_lblk;
- if (S_ISREG(inode->i_mode))
+ if (S_ISREG(inode->i_mode) &&
+ !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
ar.flags = EXT4_MB_HINT_DATA;
+ else
+ ar.flags = EXT4_MB_HINT_METADATA;
if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
ar.flags |= EXT4_MB_DELALLOC_RESERVED;
if (flags & EXT4_GET_BLOCKS_METADATA_NOFAIL)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 20f67a260df5..a676d26eccbc 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -828,6 +828,8 @@ static void
mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
+ rwlock_t *afs_locks;
+ struct list_head *afs_list;
int new_order;

if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
@@ -838,20 +840,24 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
if (new_order == grp->bb_avg_fragment_size_order)
return;

+ if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
+ EXT4_MB_GRP_TEST_IOPS(grp)) {
+ afs_locks = sbi->s_avg_fragment_size_locks_iops;
+ afs_list = sbi->s_avg_fragment_size_list_iops;
+ } else {
+ afs_locks = sbi->s_mb_avg_fragment_size_locks;
+ afs_list = sbi->s_mb_avg_fragment_size;
+ }
+
if (grp->bb_avg_fragment_size_order != -1) {
- write_lock(&sbi->s_mb_avg_fragment_size_locks[
- grp->bb_avg_fragment_size_order]);
+ write_lock(&afs_locks[grp->bb_avg_fragment_size_order]);
list_del(&grp->bb_avg_fragment_size_node);
- write_unlock(&sbi->s_mb_avg_fragment_size_locks[
- grp->bb_avg_fragment_size_order]);
+ write_unlock(&afs_locks[grp->bb_avg_fragment_size_order]);
}
grp->bb_avg_fragment_size_order = new_order;
- write_lock(&sbi->s_mb_avg_fragment_size_locks[
- grp->bb_avg_fragment_size_order]);
- list_add_tail(&grp->bb_avg_fragment_size_node,
- &sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]);
- write_unlock(&sbi->s_mb_avg_fragment_size_locks[
- grp->bb_avg_fragment_size_order]);
+ write_lock(&afs_locks[new_order]);
+ list_add_tail(&grp->bb_avg_fragment_size_node, &afs_list[new_order]);
+ write_unlock(&afs_locks[new_order]);
}

/*
@@ -986,6 +992,95 @@ next_linear_group(struct ext4_allocation_context *ac, int group, int ngroups)
return group + 1 >= ngroups ? 0 : group + 1;
}

+static bool ext4_mb_choose_next_iops_group_cr0(
+ struct ext4_allocation_context *ac, ext4_group_t *group)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ struct ext4_group_info *iter, *grp;
+ int i;
+
+ if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
+ atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
+
+ grp = NULL;
+ for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
+ if (list_empty(&sbi->s_largest_free_orders_list_iops[i]))
+ continue;
+ read_lock(&sbi->s_largest_free_orders_locks_iops[i]);
+ if (list_empty(&sbi->s_largest_free_orders_list_iops[i])) {
+ read_unlock(&sbi->s_largest_free_orders_locks_iops[i]);
+ continue;
+ }
+ grp = NULL;
+ list_for_each_entry(iter,
+ &sbi->s_largest_free_orders_list_iops[i],
+ bb_largest_free_order_node) {
+ if (sbi->s_mb_stats)
+ atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
+ if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
+ grp = iter;
+ break;
+ }
+ }
+ read_unlock(&sbi->s_largest_free_orders_locks_iops[i]);
+ if (grp)
+ break;
+ }
+
+ if (grp) {
+ *group = grp->bb_group;
+ ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
+ return true;
+ }
+
+ return false;
+}
+
+static bool ext4_mb_choose_next_iops_group_cr1(
+ struct ext4_allocation_context *ac, ext4_group_t *group)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ struct ext4_group_info *grp = NULL, *iter;
+ int i;
+
+ if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
+ if (sbi->s_mb_stats)
+ atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
+ }
+
+ for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
+ i < MB_NUM_ORDERS(ac->ac_sb); i++) {
+ if (list_empty(&sbi->s_avg_fragment_size_list_iops[i]))
+ continue;
+ read_lock(&sbi->s_avg_fragment_size_locks_iops[i]);
+ if (list_empty(&sbi->s_avg_fragment_size_list_iops[i])) {
+ read_unlock(&sbi->s_avg_fragment_size_locks_iops[i]);
+ continue;
+ }
+ list_for_each_entry(iter,
+ &sbi->s_avg_fragment_size_list_iops[i],
+ bb_avg_fragment_size_node) {
+ if (sbi->s_mb_stats)
+ atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
+ if (likely(ext4_mb_good_group(ac, iter->bb_group, 1))) {
+ grp = iter;
+ break;
+ }
+ }
+ read_unlock(&sbi->s_avg_fragment_size_locks_iops[i]);
+ if (grp)
+ break;
+ }
+
+ if (grp) {
+ *group = grp->bb_group;
+ ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
+ return true;
+ }
+
+ return false;
+}
+
/*
* ext4_mb_choose_next_group: choose next group for allocation.
*
@@ -1002,6 +1097,10 @@ next_linear_group(struct ext4_allocation_context *ac, int group, int ngroups)
static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ bool alloc_metadata = ac->ac_flags & EXT4_MB_HINT_METADATA;
+ bool ret = false;
+
*new_cr = ac->ac_criteria;

if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining) {
@@ -1009,11 +1108,37 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
return;
}

+ if (alloc_metadata && sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) {
+ if (*new_cr == 0)
+ ret = ext4_mb_choose_next_iops_group_cr0(ac, group);
+ if (!ret && *new_cr < 2)
+ ret = ext4_mb_choose_next_iops_group_cr1(ac, group);
+ if (ret)
+ return;
+ /*
+ * Cannot get metadata group from IOPS storage, fall through
+ * to slow storage.
+ */
+ cond_resched();
+ }
+
if (*new_cr == 0) {
ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
} else if (*new_cr == 1) {
ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
} else {
+ /*
+ * Cannot get data group from slow storage, try IOPS storage
+ */
+ if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
+ !alloc_metadata && sbi->s_mb_enable_iops_data &&
+ *new_cr == 3) {
+ if (ac->ac_2order)
+ ret = ext4_mb_choose_next_iops_group_cr0(ac,
+ group);
+ if (!ret)
+ ext4_mb_choose_next_iops_group_cr1(ac, group);
+ }
/*
* TODO: For CR=2, we can arrange groups in an rb tree sorted by
* bb_free. But until that happens, we should never come here.
@@ -1030,6 +1155,8 @@ static void
mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
+ rwlock_t *lfo_locks;
+ struct list_head *lfo_list;
int i;

for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
@@ -1042,21 +1169,25 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
return;
}

+ if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
+ EXT4_MB_GRP_TEST_IOPS(grp)) {
+ lfo_locks = sbi->s_largest_free_orders_locks_iops;
+ lfo_list = sbi->s_largest_free_orders_list_iops;
+ } else {
+ lfo_locks = sbi->s_mb_largest_free_orders_locks;
+ lfo_list = sbi->s_mb_largest_free_orders;
+ }
+
if (grp->bb_largest_free_order >= 0) {
- write_lock(&sbi->s_mb_largest_free_orders_locks[
- grp->bb_largest_free_order]);
+ write_lock(&lfo_locks[grp->bb_largest_free_order]);
list_del_init(&grp->bb_largest_free_order_node);
- write_unlock(&sbi->s_mb_largest_free_orders_locks[
- grp->bb_largest_free_order]);
+ write_unlock(&lfo_locks[grp->bb_largest_free_order]);
}
grp->bb_largest_free_order = i;
if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
- write_lock(&sbi->s_mb_largest_free_orders_locks[
- grp->bb_largest_free_order]);
- list_add_tail(&grp->bb_largest_free_order_node,
- &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
- write_unlock(&sbi->s_mb_largest_free_orders_locks[
- grp->bb_largest_free_order]);
+ write_lock(&lfo_locks[i]);
+ list_add_tail(&grp->bb_largest_free_order_node, &lfo_list[i]);
+ write_unlock(&lfo_locks[i]);
}
}

@@ -2498,6 +2629,10 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
goto out;
if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
goto out;
+ if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
+ (ac->ac_flags & EXT4_MB_HINT_DATA) && EXT4_MB_GRP_TEST_IOPS(grp) &&
+ !sbi->s_mb_enable_iops_data)
+ goto out;
if (should_lock) {
__acquire(ext4_group_lock_ptr(sb, group));
ext4_unlock_group(sb, group);
@@ -3150,6 +3285,9 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
init_rwsem(&meta_group_info[i]->alloc_sem);
meta_group_info[i]->bb_free_root = RB_ROOT;
+ if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
+ desc->bg_flags & EXT4_BG_IOPS)
+ EXT4_MB_GRP_SET_IOPS(meta_group_info[i]);
INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
@@ -3423,6 +3561,26 @@ int ext4_mb_init(struct super_block *sb)
INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]);
rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]);
}
+ if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) {
+ sbi->s_avg_fragment_size_list_iops =
+ kmalloc_array(MB_NUM_ORDERS(sb),
+ sizeof(struct list_head), GFP_KERNEL);
+ if (!sbi->s_avg_fragment_size_list_iops) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ sbi->s_avg_fragment_size_locks_iops =
+ kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
+ GFP_KERNEL);
+ if (!sbi->s_avg_fragment_size_locks_iops) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
+ INIT_LIST_HEAD(&sbi->s_avg_fragment_size_list_iops[i]);
+ rwlock_init(&sbi->s_avg_fragment_size_locks_iops[i]);
+ }
+ }
sbi->s_mb_largest_free_orders =
kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
GFP_KERNEL);
@@ -3441,6 +3599,27 @@ int ext4_mb_init(struct super_block *sb)
INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
}
+ if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) {
+ sbi->s_largest_free_orders_list_iops =
+ kmalloc_array(MB_NUM_ORDERS(sb),
+ sizeof(struct list_head), GFP_KERNEL);
+ if (!sbi->s_largest_free_orders_list_iops) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ sbi->s_largest_free_orders_locks_iops =
+ kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
+ GFP_KERNEL);
+ if (!sbi->s_largest_free_orders_locks_iops) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
+ INIT_LIST_HEAD(
+ &sbi->s_largest_free_orders_list_iops[i]);
+ rwlock_init(&sbi->s_largest_free_orders_locks_iops[i]);
+ }
+ }

spin_lock_init(&sbi->s_md_lock);
sbi->s_mb_free_pending = 0;
@@ -3481,6 +3660,8 @@ int ext4_mb_init(struct super_block *sb)
sbi->s_mb_group_prealloc, sbi->s_stripe);
}

+ sbi->s_mb_enable_iops_data = 0;
+
sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group);
if (sbi->s_locality_groups == NULL) {
ret = -ENOMEM;
@@ -3512,8 +3693,12 @@ int ext4_mb_init(struct super_block *sb)
out:
kfree(sbi->s_mb_avg_fragment_size);
kfree(sbi->s_mb_avg_fragment_size_locks);
+ kfree(sbi->s_avg_fragment_size_list_iops);
+ kfree(sbi->s_avg_fragment_size_locks_iops);
kfree(sbi->s_mb_largest_free_orders);
kfree(sbi->s_mb_largest_free_orders_locks);
+ kfree(sbi->s_largest_free_orders_list_iops);
+ kfree(sbi->s_largest_free_orders_locks_iops);
kfree(sbi->s_mb_offsets);
sbi->s_mb_offsets = NULL;
kfree(sbi->s_mb_maxs);
@@ -3582,8 +3767,12 @@ int ext4_mb_release(struct super_block *sb)
}
kfree(sbi->s_mb_avg_fragment_size);
kfree(sbi->s_mb_avg_fragment_size_locks);
+ kfree(sbi->s_avg_fragment_size_list_iops);
+ kfree(sbi->s_avg_fragment_size_locks_iops);
kfree(sbi->s_mb_largest_free_orders);
kfree(sbi->s_mb_largest_free_orders_locks);
+ kfree(sbi->s_largest_free_orders_list_iops);
+ kfree(sbi->s_largest_free_orders_locks_iops);
kfree(sbi->s_mb_offsets);
kfree(sbi->s_mb_maxs);
iput(sbi->s_buddy_cache);
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 3042bc605bbf..86ab6c4ed3b8 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -245,6 +245,7 @@ EXT4_ATTR(journal_task, 0444, journal_task);
EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch);
EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
EXT4_RW_ATTR_SBI_UL(last_trim_minblks, s_last_trim_minblks);
+EXT4_RW_ATTR_SBI_UI(mb_enable_iops_data, s_mb_enable_iops_data);

static unsigned int old_bump_val = 128;
EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val);
@@ -295,6 +296,7 @@ static struct attribute *ext4_attrs[] = {
ATTR_LIST(mb_prefetch),
ATTR_LIST(mb_prefetch_limit),
ATTR_LIST(last_trim_minblks),
+ ATTR_LIST(mb_enable_iops_data),
NULL,
};
ATTRIBUTE_GROUPS(ext4);
@@ -318,6 +320,7 @@ EXT4_ATTR_FEATURE(fast_commit);
#if IS_ENABLED(CONFIG_UNICODE) && defined(CONFIG_FS_ENCRYPTION)
EXT4_ATTR_FEATURE(encrypted_casefold);
#endif
+EXT4_ATTR_FEATURE(iops);

static struct attribute *ext4_feat_attrs[] = {
ATTR_LIST(lazy_itable_init),
@@ -338,6 +341,7 @@ static struct attribute *ext4_feat_attrs[] = {
#if IS_ENABLED(CONFIG_UNICODE) && defined(CONFIG_FS_ENCRYPTION)
ATTR_LIST(encrypted_casefold),
#endif
+ ATTR_LIST(iops),
NULL,
};
ATTRIBUTE_GROUPS(ext4_feat);
--
2.42.0

2023-09-18 21:48:13

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v3] ext4: optimize metadata allocation for hybrid LUNs

On Sep 12, 2023, at 12:59 AM, Bobi Jam <[email protected]> wrote:
>
> With LVM it is possible to create an LV with SSD storage at the
> beginning of the LV and HDD storage at the end of the LV, and use that
> to separate ext4 metadata allocations (that need small random IOs)
> from data allocations (that are better suited for large sequential
> IOs) depending on the type of underlying storage. Between 0.5-1.0% of
> the filesystem capacity would need to be high-IOPS storage in order to
> hold all of the internal metadata.
>
> This would improve performance for inode and other metadata access,
> such as ls, find, e2fsck, and in general improve file access latency,
> modification, truncate, unlink, transaction commit, etc.
>
> This patch split largest free order group lists and average fragment
> size lists into other two lists for IOPS/fast storage groups, and
> cr 0 / cr 1 group scanning for metadata block allocation in following
> order:
>
> if (allocate metadata blocks)
> if (cr == 0)
> try to find group in largest free order IOPS group list
> if (cr == 1)
> try to find group in fragment size IOPS group list
> if (above two find failed)
> fall through normal group lists as before
> if (allocate data blocks)
> try to find group in normal group lists as before
> if (failed to find group in normal group && mb_enable_iops_data)
> try to find group in IOPS groups
>
> Non-metadata block allocation does not allocate from the IOPS groups
> if non-IOPS groups are not used up.

Hi Ritesh,
I believe this updated version of the patch addresses your original
request that it is possible to allocate blocks from the IOPS block
groups if the non-IOPS groups are full. This is currently disabled
by default, because in cases where the IOPS groups make up only a
small fraction of the space (e.g. < 1% of total capacity) having data
blocks allocated from this space would not make a big improvement
to the end-user usage of the filesystem, but would semi-permanently
hurt the ability to allocate metadata into the IOPS groups.

We discussed on the ext4 concall various options to make this more
useful (e.g. allowing the root user to allocate from the IOPS groups
if the filesystem is out of space, having a heuristic to balance IOPS
vs. non-IOPS allocations for small files, having a BPF rule that can
specify which UID/GID/procname/filename/etc. can access this space,
but everyone was reluctant to put any complex policy into the kernel
to make any decision, since this eventually is wrong for some usage.

For now, there is just a simple on/off switch, and if this is enabled
the IOPS groups are only used when all of the non-IOPS groups are full.
Any more complex policy can be deferred to a separate patch, I think.

It has worked well so far in our testing.

Cheers, Andreas

> Add for mke2fs an option to mark which blocks are in the IOPS region
> of storage at format time (usually only one IOPS region is needed):
>
> -E iops=0-1024G,4096-8192G
>
> so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
> group descriptors to decide which groups to allocate dynamic
> filesystem metadata.
>
> Signed-off-by: Bobi Jam <[email protected]
>
> --
> v2->v3: add sysfs mb_enable_iops_data to enable data block allocation
> from IOPS groups.
> v1->v2: for metadata block allocation, search in IOPS list then normal
> list.
> ---
> fs/ext4/balloc.c | 2 +-
> fs/ext4/ext4.h | 13 +++
> fs/ext4/extents.c | 5 +-
> fs/ext4/indirect.c | 5 +-
> fs/ext4/mballoc.c | 229 +++++++++++++++++++++++++++++++++++++++++----
> fs/ext4/sysfs.c | 4 +
> 6 files changed, 234 insertions(+), 24 deletions(-)
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index c1edde817be8..7b1b3ec2650c 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -739,7 +739,7 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode,
> ar.inode = inode;
> ar.goal = goal;
> ar.len = count ? *count : 1;
> - ar.flags = flags;
> + ar.flags = flags | EXT4_MB_HINT_METADATA;
>
> ret = ext4_mb_new_blocks(handle, &ar, errp);
> if (count)
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 8104a21b001a..a8f21f63f5ff 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -382,6 +382,7 @@ struct flex_groups {
> #define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */
> #define EXT4_BG_BLOCK_UNINIT 0x0002 /* Block bitmap not in use */
> #define EXT4_BG_INODE_ZEROED 0x0004 /* On-disk itable initialized to zero */
> +#define EXT4_BG_IOPS 0x0010 /* In IOPS/fast storage */
>
> /*
> * Macro-instructions used to manage group descriptors
> @@ -1112,6 +1113,8 @@ struct ext4_inode_info {
> #define EXT2_FLAGS_UNSIGNED_HASH 0x0002 /* Unsigned dirhash in use */
> #define EXT2_FLAGS_TEST_FILESYS 0x0004 /* to test development code */
>
> +#define EXT2_FLAGS_HAS_IOPS 0x0080 /* has IOPS storage */
> +
> /*
> * Mount flags set via mount options or defaults
> */
> @@ -1514,8 +1517,12 @@ struct ext4_sb_info {
> atomic_t s_retry_alloc_pending;
> struct list_head *s_mb_avg_fragment_size;
> rwlock_t *s_mb_avg_fragment_size_locks;
> + struct list_head *s_avg_fragment_size_list_iops; /* avg_frament_size for IOPS groups */
> + rwlock_t *s_avg_fragment_size_locks_iops;
> struct list_head *s_mb_largest_free_orders;
> rwlock_t *s_mb_largest_free_orders_locks;
> + struct list_head *s_largest_free_orders_list_iops; /* largest_free_orders for IOPS grps */
> + rwlock_t *s_largest_free_orders_locks_iops;
>
> /* tunables */
> unsigned long s_stripe;
> @@ -1532,6 +1539,7 @@ struct ext4_sb_info {
> unsigned long s_mb_last_start;
> unsigned int s_mb_prefetch;
> unsigned int s_mb_prefetch_limit;
> + unsigned int s_mb_enable_iops_data;
>
> /* stats for buddy allocator */
> atomic_t s_bal_reqs; /* number of reqs with len > 1 */
> @@ -3366,6 +3374,7 @@ struct ext4_group_info {
> #define EXT4_GROUP_INFO_IBITMAP_CORRUPT \
> (1 << EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT)
> #define EXT4_GROUP_INFO_BBITMAP_READ_BIT 4
> +#define EXT4_GROUP_INFO_IOPS_BIT 5
>
> #define EXT4_MB_GRP_NEED_INIT(grp) \
> (test_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &((grp)->bb_state)))
> @@ -3382,6 +3391,10 @@ struct ext4_group_info {
> (clear_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
> #define EXT4_MB_GRP_TEST_AND_SET_READ(grp) \
> (test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &((grp)->bb_state)))
> +#define EXT4_MB_GRP_TEST_IOPS(grp) \
> + (test_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
> +#define EXT4_MB_GRP_SET_IOPS(grp) \
> + (set_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
>
> #define EXT4_MAX_CONTENTION 8
> #define EXT4_CONTENTION_THRESHOLD 2
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 35703dce23a3..6bfa784a3dad 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4272,11 +4272,12 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> ar.len = EXT4_NUM_B2C(sbi, offset+allocated);
> ar.goal -= offset;
> ar.logical -= offset;
> - if (S_ISREG(inode->i_mode))
> + if (S_ISREG(inode->i_mode) &&
> + !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> ar.flags = EXT4_MB_HINT_DATA;
> else
> /* disable in-core preallocation for non-regular files */
> - ar.flags = 0;
> + ar.flags = EXT4_MB_HINT_METADATA;
> if (flags & EXT4_GET_BLOCKS_NO_NORMALIZE)
> ar.flags |= EXT4_MB_HINT_NOPREALLOC;
> if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index c68bebe7ff4b..e1042c4e8ce6 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -610,8 +610,11 @@ int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
> memset(&ar, 0, sizeof(ar));
> ar.inode = inode;
> ar.logical = map->m_lblk;
> - if (S_ISREG(inode->i_mode))
> + if (S_ISREG(inode->i_mode) &&
> + !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> ar.flags = EXT4_MB_HINT_DATA;
> + else
> + ar.flags = EXT4_MB_HINT_METADATA;
> if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
> ar.flags |= EXT4_MB_DELALLOC_RESERVED;
> if (flags & EXT4_GET_BLOCKS_METADATA_NOFAIL)
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 20f67a260df5..a676d26eccbc 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -828,6 +828,8 @@ static void
> mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
> {
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> + rwlock_t *afs_locks;
> + struct list_head *afs_list;
> int new_order;
>
> if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
> @@ -838,20 +840,24 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
> if (new_order == grp->bb_avg_fragment_size_order)
> return;
>
> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
> + EXT4_MB_GRP_TEST_IOPS(grp)) {
> + afs_locks = sbi->s_avg_fragment_size_locks_iops;
> + afs_list = sbi->s_avg_fragment_size_list_iops;
> + } else {
> + afs_locks = sbi->s_mb_avg_fragment_size_locks;
> + afs_list = sbi->s_mb_avg_fragment_size;
> + }
> +
> if (grp->bb_avg_fragment_size_order != -1) {
> - write_lock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> + write_lock(&afs_locks[grp->bb_avg_fragment_size_order]);
> list_del(&grp->bb_avg_fragment_size_node);
> - write_unlock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> + write_unlock(&afs_locks[grp->bb_avg_fragment_size_order]);
> }
> grp->bb_avg_fragment_size_order = new_order;
> - write_lock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> - list_add_tail(&grp->bb_avg_fragment_size_node,
> - &sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]);
> - write_unlock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> + write_lock(&afs_locks[new_order]);
> + list_add_tail(&grp->bb_avg_fragment_size_node, &afs_list[new_order]);
> + write_unlock(&afs_locks[new_order]);
> }
>
> /*
> @@ -986,6 +992,95 @@ next_linear_group(struct ext4_allocation_context *ac, int group, int ngroups)
> return group + 1 >= ngroups ? 0 : group + 1;
> }
>
> +static bool ext4_mb_choose_next_iops_group_cr0(
> + struct ext4_allocation_context *ac, ext4_group_t *group)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + struct ext4_group_info *iter, *grp;
> + int i;
> +
> + if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
> + atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
> +
> + grp = NULL;
> + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> + if (list_empty(&sbi->s_largest_free_orders_list_iops[i]))
> + continue;
> + read_lock(&sbi->s_largest_free_orders_locks_iops[i]);
> + if (list_empty(&sbi->s_largest_free_orders_list_iops[i])) {
> + read_unlock(&sbi->s_largest_free_orders_locks_iops[i]);
> + continue;
> + }
> + grp = NULL;
> + list_for_each_entry(iter,
> + &sbi->s_largest_free_orders_list_iops[i],
> + bb_largest_free_order_node) {
> + if (sbi->s_mb_stats)
> + atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
> + if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
> + grp = iter;
> + break;
> + }
> + }
> + read_unlock(&sbi->s_largest_free_orders_locks_iops[i]);
> + if (grp)
> + break;
> + }
> +
> + if (grp) {
> + *group = grp->bb_group;
> + ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
> + return true;
> + }
> +
> + return false;
> +}
> +
> +static bool ext4_mb_choose_next_iops_group_cr1(
> + struct ext4_allocation_context *ac, ext4_group_t *group)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + struct ext4_group_info *grp = NULL, *iter;
> + int i;
> +
> + if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
> + if (sbi->s_mb_stats)
> + atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
> + }
> +
> + for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
> + i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> + if (list_empty(&sbi->s_avg_fragment_size_list_iops[i]))
> + continue;
> + read_lock(&sbi->s_avg_fragment_size_locks_iops[i]);
> + if (list_empty(&sbi->s_avg_fragment_size_list_iops[i])) {
> + read_unlock(&sbi->s_avg_fragment_size_locks_iops[i]);
> + continue;
> + }
> + list_for_each_entry(iter,
> + &sbi->s_avg_fragment_size_list_iops[i],
> + bb_avg_fragment_size_node) {
> + if (sbi->s_mb_stats)
> + atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
> + if (likely(ext4_mb_good_group(ac, iter->bb_group, 1))) {
> + grp = iter;
> + break;
> + }
> + }
> + read_unlock(&sbi->s_avg_fragment_size_locks_iops[i]);
> + if (grp)
> + break;
> + }
> +
> + if (grp) {
> + *group = grp->bb_group;
> + ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
> + return true;
> + }
> +
> + return false;
> +}
> +
> /*
> * ext4_mb_choose_next_group: choose next group for allocation.
> *
> @@ -1002,6 +1097,10 @@ next_linear_group(struct ext4_allocation_context *ac, int group, int ngroups)
> static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> {
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + bool alloc_metadata = ac->ac_flags & EXT4_MB_HINT_METADATA;
> + bool ret = false;
> +
> *new_cr = ac->ac_criteria;
>
> if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining) {
> @@ -1009,11 +1108,37 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> return;
> }
>
> + if (alloc_metadata && sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) {
> + if (*new_cr == 0)
> + ret = ext4_mb_choose_next_iops_group_cr0(ac, group);
> + if (!ret && *new_cr < 2)
> + ret = ext4_mb_choose_next_iops_group_cr1(ac, group);
> + if (ret)
> + return;
> + /*
> + * Cannot get metadata group from IOPS storage, fall through
> + * to slow storage.
> + */
> + cond_resched();
> + }
> +
> if (*new_cr == 0) {
> ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
> } else if (*new_cr == 1) {
> ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
> } else {
> + /*
> + * Cannot get data group from slow storage, try IOPS storage
> + */
> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
> + !alloc_metadata && sbi->s_mb_enable_iops_data &&
> + *new_cr == 3) {
> + if (ac->ac_2order)
> + ret = ext4_mb_choose_next_iops_group_cr0(ac,
> + group);
> + if (!ret)
> + ext4_mb_choose_next_iops_group_cr1(ac, group);
> + }
> /*
> * TODO: For CR=2, we can arrange groups in an rb tree sorted by
> * bb_free. But until that happens, we should never come here.
> @@ -1030,6 +1155,8 @@ static void
> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
> {
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> + rwlock_t *lfo_locks;
> + struct list_head *lfo_list;
> int i;
>
> for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
> @@ -1042,21 +1169,25 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
> return;
> }
>
> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
> + EXT4_MB_GRP_TEST_IOPS(grp)) {
> + lfo_locks = sbi->s_largest_free_orders_locks_iops;
> + lfo_list = sbi->s_largest_free_orders_list_iops;
> + } else {
> + lfo_locks = sbi->s_mb_largest_free_orders_locks;
> + lfo_list = sbi->s_mb_largest_free_orders;
> + }
> +
> if (grp->bb_largest_free_order >= 0) {
> - write_lock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> + write_lock(&lfo_locks[grp->bb_largest_free_order]);
> list_del_init(&grp->bb_largest_free_order_node);
> - write_unlock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> + write_unlock(&lfo_locks[grp->bb_largest_free_order]);
> }
> grp->bb_largest_free_order = i;
> if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
> - write_lock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> - list_add_tail(&grp->bb_largest_free_order_node,
> - &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> - write_unlock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> + write_lock(&lfo_locks[i]);
> + list_add_tail(&grp->bb_largest_free_order_node, &lfo_list[i]);
> + write_unlock(&lfo_locks[i]);
> }
> }
>
> @@ -2498,6 +2629,10 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
> goto out;
> if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
> goto out;
> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
> + (ac->ac_flags & EXT4_MB_HINT_DATA) && EXT4_MB_GRP_TEST_IOPS(grp) &&
> + !sbi->s_mb_enable_iops_data)
> + goto out;
> if (should_lock) {
> __acquire(ext4_group_lock_ptr(sb, group));
> ext4_unlock_group(sb, group);
> @@ -3150,6 +3285,9 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
> INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
> init_rwsem(&meta_group_info[i]->alloc_sem);
> meta_group_info[i]->bb_free_root = RB_ROOT;
> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
> + desc->bg_flags & EXT4_BG_IOPS)
> + EXT4_MB_GRP_SET_IOPS(meta_group_info[i]);
> INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
> INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
> meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
> @@ -3423,6 +3561,26 @@ int ext4_mb_init(struct super_block *sb)
> INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]);
> rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]);
> }
> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) {
> + sbi->s_avg_fragment_size_list_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb),
> + sizeof(struct list_head), GFP_KERNEL);
> + if (!sbi->s_avg_fragment_size_list_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + sbi->s_avg_fragment_size_locks_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> + GFP_KERNEL);
> + if (!sbi->s_avg_fragment_size_locks_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> + INIT_LIST_HEAD(&sbi->s_avg_fragment_size_list_iops[i]);
> + rwlock_init(&sbi->s_avg_fragment_size_locks_iops[i]);
> + }
> + }
> sbi->s_mb_largest_free_orders =
> kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> GFP_KERNEL);
> @@ -3441,6 +3599,27 @@ int ext4_mb_init(struct super_block *sb)
> INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
> rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
> }
> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) {
> + sbi->s_largest_free_orders_list_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb),
> + sizeof(struct list_head), GFP_KERNEL);
> + if (!sbi->s_largest_free_orders_list_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + sbi->s_largest_free_orders_locks_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> + GFP_KERNEL);
> + if (!sbi->s_largest_free_orders_locks_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> + INIT_LIST_HEAD(
> + &sbi->s_largest_free_orders_list_iops[i]);
> + rwlock_init(&sbi->s_largest_free_orders_locks_iops[i]);
> + }
> + }
>
> spin_lock_init(&sbi->s_md_lock);
> sbi->s_mb_free_pending = 0;
> @@ -3481,6 +3660,8 @@ int ext4_mb_init(struct super_block *sb)
> sbi->s_mb_group_prealloc, sbi->s_stripe);
> }
>
> + sbi->s_mb_enable_iops_data = 0;
> +
> sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group);
> if (sbi->s_locality_groups == NULL) {
> ret = -ENOMEM;
> @@ -3512,8 +3693,12 @@ int ext4_mb_init(struct super_block *sb)
> out:
> kfree(sbi->s_mb_avg_fragment_size);
> kfree(sbi->s_mb_avg_fragment_size_locks);
> + kfree(sbi->s_avg_fragment_size_list_iops);
> + kfree(sbi->s_avg_fragment_size_locks_iops);
> kfree(sbi->s_mb_largest_free_orders);
> kfree(sbi->s_mb_largest_free_orders_locks);
> + kfree(sbi->s_largest_free_orders_list_iops);
> + kfree(sbi->s_largest_free_orders_locks_iops);
> kfree(sbi->s_mb_offsets);
> sbi->s_mb_offsets = NULL;
> kfree(sbi->s_mb_maxs);
> @@ -3582,8 +3767,12 @@ int ext4_mb_release(struct super_block *sb)
> }
> kfree(sbi->s_mb_avg_fragment_size);
> kfree(sbi->s_mb_avg_fragment_size_locks);
> + kfree(sbi->s_avg_fragment_size_list_iops);
> + kfree(sbi->s_avg_fragment_size_locks_iops);
> kfree(sbi->s_mb_largest_free_orders);
> kfree(sbi->s_mb_largest_free_orders_locks);
> + kfree(sbi->s_largest_free_orders_list_iops);
> + kfree(sbi->s_largest_free_orders_locks_iops);
> kfree(sbi->s_mb_offsets);
> kfree(sbi->s_mb_maxs);
> iput(sbi->s_buddy_cache);
> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> index 3042bc605bbf..86ab6c4ed3b8 100644
> --- a/fs/ext4/sysfs.c
> +++ b/fs/ext4/sysfs.c
> @@ -245,6 +245,7 @@ EXT4_ATTR(journal_task, 0444, journal_task);
> EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch);
> EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
> EXT4_RW_ATTR_SBI_UL(last_trim_minblks, s_last_trim_minblks);
> +EXT4_RW_ATTR_SBI_UI(mb_enable_iops_data, s_mb_enable_iops_data);
>
> static unsigned int old_bump_val = 128;
> EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val);
> @@ -295,6 +296,7 @@ static struct attribute *ext4_attrs[] = {
> ATTR_LIST(mb_prefetch),
> ATTR_LIST(mb_prefetch_limit),
> ATTR_LIST(last_trim_minblks),
> + ATTR_LIST(mb_enable_iops_data),
> NULL,
> };
> ATTRIBUTE_GROUPS(ext4);
> @@ -318,6 +320,7 @@ EXT4_ATTR_FEATURE(fast_commit);
> #if IS_ENABLED(CONFIG_UNICODE) && defined(CONFIG_FS_ENCRYPTION)
> EXT4_ATTR_FEATURE(encrypted_casefold);
> #endif
> +EXT4_ATTR_FEATURE(iops);
>
> static struct attribute *ext4_feat_attrs[] = {
> ATTR_LIST(lazy_itable_init),
> @@ -338,6 +341,7 @@ static struct attribute *ext4_feat_attrs[] = {
> #if IS_ENABLED(CONFIG_UNICODE) && defined(CONFIG_FS_ENCRYPTION)
> ATTR_LIST(encrypted_casefold),
> #endif
> + ATTR_LIST(iops),
> NULL,
> };
> ATTRIBUTE_GROUPS(ext4_feat);
> --
> 2.42.0
>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2023-09-20 05:32:40

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v3] ext4: optimize metadata allocation for hybrid LUNs

Andreas Dilger <[email protected]> writes:

> On Sep 12, 2023, at 12:59 AM, Bobi Jam <[email protected]> wrote:
>>
>> With LVM it is possible to create an LV with SSD storage at the
>> beginning of the LV and HDD storage at the end of the LV, and use that
>> to separate ext4 metadata allocations (that need small random IOs)
>> from data allocations (that are better suited for large sequential
>> IOs) depending on the type of underlying storage. Between 0.5-1.0% of
>> the filesystem capacity would need to be high-IOPS storage in order to
>> hold all of the internal metadata.
>>
>> This would improve performance for inode and other metadata access,
>> such as ls, find, e2fsck, and in general improve file access latency,
>> modification, truncate, unlink, transaction commit, etc.
>>
>> This patch split largest free order group lists and average fragment
>> size lists into other two lists for IOPS/fast storage groups, and
>> cr 0 / cr 1 group scanning for metadata block allocation in following
>> order:
>>
>> if (allocate metadata blocks)
>> if (cr == 0)
>> try to find group in largest free order IOPS group list
>> if (cr == 1)
>> try to find group in fragment size IOPS group list
>> if (above two find failed)
>> fall through normal group lists as before
>> if (allocate data blocks)
>> try to find group in normal group lists as before
>> if (failed to find group in normal group && mb_enable_iops_data)
>> try to find group in IOPS groups
>>
>> Non-metadata block allocation does not allocate from the IOPS groups
>> if non-IOPS groups are not used up.
>
> Hi Ritesh,
> I believe this updated version of the patch addresses your original
> request that it is possible to allocate blocks from the IOPS block
> groups if the non-IOPS groups are full. This is currently disabled
> by default, because in cases where the IOPS groups make up only a
> small fraction of the space (e.g. < 1% of total capacity) having data
> blocks allocated from this space would not make a big improvement
> to the end-user usage of the filesystem, but would semi-permanently
> hurt the ability to allocate metadata into the IOPS groups.
>
> We discussed on the ext4 concall various options to make this more
> useful (e.g. allowing the root user to allocate from the IOPS groups
> if the filesystem is out of space, having a heuristic to balance IOPS
> vs. non-IOPS allocations for small files, having a BPF rule that can
> specify which UID/GID/procname/filename/etc. can access this space,
> but everyone was reluctant to put any complex policy into the kernel
> to make any decision, since this eventually is wrong for some usage.
>
> For now, there is just a simple on/off switch, and if this is enabled
> the IOPS groups are only used when all of the non-IOPS groups are full.
> Any more complex policy can be deferred to a separate patch, I think.

I think having a on/off switch for any user to enable/disable allocation
of data from iops groups is good enough for now. Atleast users with
larger iops disk space won't run out of ENOSPC if they enable with this feature.

So, thanks for addressing it. I am going through the series. I will provide
my review comments shortly.

Meanwhile, here is my understanding of your usecase. Can you please
correct add your inputs to this -

1. You would like to create a FS with a combination of high iops storage
disk and non-high iops disk. With high iops disk space to be around 1 %
of the total disk capacity. (well this is obvious as it is stated in the
patch description itself)

2. Since ofcourse ext4 currently does not support multi-drive, so we
will use it using LVM and place high iops disk in front.

3. Then at the creation of the FS we will use a cmd like this
mkfs.ext4 -O sparse_super2 -E packed_meta_blocks,iops=0-1024G /path/to/lvm

Is this understanding right?

I have few followup queries as well -

1. What about Thin Provisioned LVM? IIUC, the space in that is
pre-allocated, but allocation happens at the time of write and it might
so happen that both data/metadata allocations will start to sit in
iops/non-iops group randomly?

2. Even in case of taditional LVM, the mapping of the physical blocks
can be changed during an overwrite or discard sort of usecase right? So
do we have a gurantee of the metadata always sitting on high iops groups
after filesystem ages?

3. With this options of mkfs to utilize this feature, we do loose the
ability to resize right? I am guessing resize will be disabled with
sparse_super2 and/or packed_meta_blocks itself?


Thanks!
-ritesh

2023-09-20 07:10:57

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v3] ext4: optimize metadata allocation for hybrid LUNs

Bobi Jam <[email protected]> writes:

> With LVM it is possible to create an LV with SSD storage at the
> beginning of the LV and HDD storage at the end of the LV, and use that
> to separate ext4 metadata allocations (that need small random IOs)
> from data allocations (that are better suited for large sequential
> IOs) depending on the type of underlying storage. Between 0.5-1.0% of
> the filesystem capacity would need to be high-IOPS storage in order to
> hold all of the internal metadata.
>
> This would improve performance for inode and other metadata access,
> such as ls, find, e2fsck, and in general improve file access latency,
> modification, truncate, unlink, transaction commit, etc.
>
> This patch split largest free order group lists and average fragment
> size lists into other two lists for IOPS/fast storage groups, and
> cr 0 / cr 1 group scanning for metadata block allocation in following
> order:
>
> if (allocate metadata blocks)
> if (cr == 0)
> try to find group in largest free order IOPS group list
> if (cr == 1)
> try to find group in fragment size IOPS group list
> if (above two find failed)
> fall through normal group lists as before

Ok, so we are agreeing that if the iops groups are full, we will
fallback to non-iops group for metadata.


> if (allocate data blocks)
> try to find group in normal group lists as before
> if (failed to find group in normal group && mb_enable_iops_data)
> try to find group in IOPS groups

same here but with mb_enable_iops_data.


>
> Non-metadata block allocation does not allocate from the IOPS groups
> if non-IOPS groups are not used up.

Sure. At least ENOSPC use case can be handled once mb_enable_iops_data
is enabled. (for users who might end up using large iops disk)

>
> Add for mke2fs an option to mark which blocks are in the IOPS region
> of storage at format time:
>
> -E iops=0-1024G,4096-8192G
>
> so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
> group descriptors to decide which groups to allocate dynamic
> filesystem metadata.
>
> Signed-off-by: Bobi Jam <[email protected]
>
> --
> v2->v3: add sysfs mb_enable_iops_data to enable data block allocation
> from IOPS groups.
> v1->v2: for metadata block allocation, search in IOPS list then normal
> list.
> ---
> fs/ext4/balloc.c | 2 +-
> fs/ext4/ext4.h | 13 +++
> fs/ext4/extents.c | 5 +-
> fs/ext4/indirect.c | 5 +-
> fs/ext4/mballoc.c | 229 +++++++++++++++++++++++++++++++++++++++++----
> fs/ext4/sysfs.c | 4 +
> 6 files changed, 234 insertions(+), 24 deletions(-)
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index c1edde817be8..7b1b3ec2650c 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -739,7 +739,7 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode,
> ar.inode = inode;
> ar.goal = goal;
> ar.len = count ? *count : 1;
> - ar.flags = flags;
> + ar.flags = flags | EXT4_MB_HINT_METADATA;
>
> ret = ext4_mb_new_blocks(handle, &ar, errp);
> if (count)
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 8104a21b001a..a8f21f63f5ff 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -382,6 +382,7 @@ struct flex_groups {
> #define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */
> #define EXT4_BG_BLOCK_UNINIT 0x0002 /* Block bitmap not in use */
> #define EXT4_BG_INODE_ZEROED 0x0004 /* On-disk itable initialized to zero */
> +#define EXT4_BG_IOPS 0x0010 /* In IOPS/fast storage */

Not related to this patch. But why not 0x0008? Is it reserved for
anything else?


>
> /*
> * Macro-instructions used to manage group descriptors
> @@ -1112,6 +1113,8 @@ struct ext4_inode_info {
> #define EXT2_FLAGS_UNSIGNED_HASH 0x0002 /* Unsigned dirhash in use */
> #define EXT2_FLAGS_TEST_FILESYS 0x0004 /* to test development code */
>
> +#define EXT2_FLAGS_HAS_IOPS 0x0080 /* has IOPS storage */
> +

same here. Are the flag values in between 0x0004 and 0x0080 are reserved?

> /*
> * Mount flags set via mount options or defaults
> */
> @@ -1514,8 +1517,12 @@ struct ext4_sb_info {
> atomic_t s_retry_alloc_pending;
> struct list_head *s_mb_avg_fragment_size;
> rwlock_t *s_mb_avg_fragment_size_locks;
> + struct list_head *s_avg_fragment_size_list_iops; /* avg_frament_size for IOPS groups */
> + rwlock_t *s_avg_fragment_size_locks_iops;
> struct list_head *s_mb_largest_free_orders;
> rwlock_t *s_mb_largest_free_orders_locks;
> + struct list_head *s_largest_free_orders_list_iops; /* largest_free_orders for IOPS grps */
> + rwlock_t *s_largest_free_orders_locks_iops;
>
> /* tunables */
> unsigned long s_stripe;
> @@ -1532,6 +1539,7 @@ struct ext4_sb_info {
> unsigned long s_mb_last_start;
> unsigned int s_mb_prefetch;
> unsigned int s_mb_prefetch_limit;
> + unsigned int s_mb_enable_iops_data;
>
> /* stats for buddy allocator */
> atomic_t s_bal_reqs; /* number of reqs with len > 1 */
> @@ -3366,6 +3374,7 @@ struct ext4_group_info {
> #define EXT4_GROUP_INFO_IBITMAP_CORRUPT \
> (1 << EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT)
> #define EXT4_GROUP_INFO_BBITMAP_READ_BIT 4
> +#define EXT4_GROUP_INFO_IOPS_BIT 5
>
> #define EXT4_MB_GRP_NEED_INIT(grp) \
> (test_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &((grp)->bb_state)))
> @@ -3382,6 +3391,10 @@ struct ext4_group_info {
> (clear_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
> #define EXT4_MB_GRP_TEST_AND_SET_READ(grp) \
> (test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &((grp)->bb_state)))
> +#define EXT4_MB_GRP_TEST_IOPS(grp) \
> + (test_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
> +#define EXT4_MB_GRP_SET_IOPS(grp) \
> + (set_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
>
> #define EXT4_MAX_CONTENTION 8
> #define EXT4_CONTENTION_THRESHOLD 2
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 35703dce23a3..6bfa784a3dad 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4272,11 +4272,12 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> ar.len = EXT4_NUM_B2C(sbi, offset+allocated);
> ar.goal -= offset;
> ar.logical -= offset;
> - if (S_ISREG(inode->i_mode))
> + if (S_ISREG(inode->i_mode) &&
> + !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> ar.flags = EXT4_MB_HINT_DATA;
> else
> /* disable in-core preallocation for non-regular files */
> - ar.flags = 0;
> + ar.flags = EXT4_MB_HINT_METADATA;
> if (flags & EXT4_GET_BLOCKS_NO_NORMALIZE)
> ar.flags |= EXT4_MB_HINT_NOPREALLOC;
> if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index c68bebe7ff4b..e1042c4e8ce6 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -610,8 +610,11 @@ int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
> memset(&ar, 0, sizeof(ar));
> ar.inode = inode;
> ar.logical = map->m_lblk;
> - if (S_ISREG(inode->i_mode))
> + if (S_ISREG(inode->i_mode) &&
> + !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> ar.flags = EXT4_MB_HINT_DATA;
> + else
> + ar.flags = EXT4_MB_HINT_METADATA;
> if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
> ar.flags |= EXT4_MB_DELALLOC_RESERVED;
> if (flags & EXT4_GET_BLOCKS_METADATA_NOFAIL)
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 20f67a260df5..a676d26eccbc 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -828,6 +828,8 @@ static void
> mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
> {
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> + rwlock_t *afs_locks;
> + struct list_head *afs_list;
> int new_order;
>
> if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
> @@ -838,20 +840,24 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
> if (new_order == grp->bb_avg_fragment_size_order)
> return;
>
> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
> + EXT4_MB_GRP_TEST_IOPS(grp)) {
> + afs_locks = sbi->s_avg_fragment_size_locks_iops;
> + afs_list = sbi->s_avg_fragment_size_list_iops;
> + } else {
> + afs_locks = sbi->s_mb_avg_fragment_size_locks;
> + afs_list = sbi->s_mb_avg_fragment_size;
> + }
> +
> if (grp->bb_avg_fragment_size_order != -1) {
> - write_lock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> + write_lock(&afs_locks[grp->bb_avg_fragment_size_order]);
> list_del(&grp->bb_avg_fragment_size_node);
> - write_unlock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> + write_unlock(&afs_locks[grp->bb_avg_fragment_size_order]);
> }
> grp->bb_avg_fragment_size_order = new_order;
> - write_lock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> - list_add_tail(&grp->bb_avg_fragment_size_node,
> - &sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]);
> - write_unlock(&sbi->s_mb_avg_fragment_size_locks[
> - grp->bb_avg_fragment_size_order]);
> + write_lock(&afs_locks[new_order]);
> + list_add_tail(&grp->bb_avg_fragment_size_node, &afs_list[new_order]);
> + write_unlock(&afs_locks[new_order]);
> }
>
> /*
> @@ -986,6 +992,95 @@ next_linear_group(struct ext4_allocation_context *ac, int group, int ngroups)
> return group + 1 >= ngroups ? 0 : group + 1;
> }
>
> +static bool ext4_mb_choose_next_iops_group_cr0(
> + struct ext4_allocation_context *ac, ext4_group_t *group)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + struct ext4_group_info *iter, *grp;
> + int i;
> +
> + if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
> + atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
> +
> + grp = NULL;
> + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> + if (list_empty(&sbi->s_largest_free_orders_list_iops[i]))
> + continue;
> + read_lock(&sbi->s_largest_free_orders_locks_iops[i]);
> + if (list_empty(&sbi->s_largest_free_orders_list_iops[i])) {
> + read_unlock(&sbi->s_largest_free_orders_locks_iops[i]);
> + continue;
> + }
> + grp = NULL;
> + list_for_each_entry(iter,
> + &sbi->s_largest_free_orders_list_iops[i],
> + bb_largest_free_order_node) {
> + if (sbi->s_mb_stats)
> + atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
> + if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
> + grp = iter;
> + break;
> + }
> + }
> + read_unlock(&sbi->s_largest_free_orders_locks_iops[i]);
> + if (grp)
> + break;
> + }
> +
> + if (grp) {
> + *group = grp->bb_group;
> + ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
> + return true;
> + }
> +
> + return false;
> +}
> +
> +static bool ext4_mb_choose_next_iops_group_cr1(
> + struct ext4_allocation_context *ac, ext4_group_t *group)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + struct ext4_group_info *grp = NULL, *iter;
> + int i;
> +
> + if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
> + if (sbi->s_mb_stats)
> + atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
> + }
> +
> + for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
> + i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> + if (list_empty(&sbi->s_avg_fragment_size_list_iops[i]))
> + continue;
> + read_lock(&sbi->s_avg_fragment_size_locks_iops[i]);
> + if (list_empty(&sbi->s_avg_fragment_size_list_iops[i])) {
> + read_unlock(&sbi->s_avg_fragment_size_locks_iops[i]);
> + continue;
> + }
> + list_for_each_entry(iter,
> + &sbi->s_avg_fragment_size_list_iops[i],
> + bb_avg_fragment_size_node) {
> + if (sbi->s_mb_stats)
> + atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
> + if (likely(ext4_mb_good_group(ac, iter->bb_group, 1))) {
> + grp = iter;
> + break;
> + }
> + }
> + read_unlock(&sbi->s_avg_fragment_size_locks_iops[i]);
> + if (grp)
> + break;
> + }
> +
> + if (grp) {
> + *group = grp->bb_group;
> + ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
> + return true;
> + }
> +
> + return false;
> +}
> +
> /*
> * ext4_mb_choose_next_group: choose next group for allocation.
> *
> @@ -1002,6 +1097,10 @@ next_linear_group(struct ext4_allocation_context *ac, int group, int ngroups)
> static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> {
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + bool alloc_metadata = ac->ac_flags & EXT4_MB_HINT_METADATA;
> + bool ret = false;
> +
> *new_cr = ac->ac_criteria;
>
> if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining) {
> @@ -1009,11 +1108,37 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> return;
> }
>
> + if (alloc_metadata && sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) {
> + if (*new_cr == 0)
> + ret = ext4_mb_choose_next_iops_group_cr0(ac, group);
> + if (!ret && *new_cr < 2)
> + ret = ext4_mb_choose_next_iops_group_cr1(ac, group);

This is a bit confusing here. Say if *new_cr = 0 fails, then we return
ret = false and fallback to choosing xx_iops_group_cr1(). And say if we
were able to find a group which satisfies this allocation request we
return. But the caller never knows that we allocated using cr1 and not
cr0. Because we never updated *new_cr inside xx_iops_group_crX()

> + if (ret)
> + return;
> + /*
> + * Cannot get metadata group from IOPS storage, fall through
> + * to slow storage.
> + */
> + cond_resched();

Not sure after you fix above case, do we still require cond_resched()
here. Note we already have one in the for loop which iterates over all
the groups for a given ac_criteria.

> + }
> +
> if (*new_cr == 0) {
> ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
> } else if (*new_cr == 1) {
> ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
> } else {
> + /*
> + * Cannot get data group from slow storage, try IOPS storage
> + */
> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
> + !alloc_metadata && sbi->s_mb_enable_iops_data &&
> + *new_cr == 3) {
> + if (ac->ac_2order)
> + ret = ext4_mb_choose_next_iops_group_cr0(ac,
> + group);
> + if (!ret)
> + ext4_mb_choose_next_iops_group_cr1(ac, group);
> + }

We might never come here in this else case because
should_optimize_scan() which we check in the beginning of this function
will return 0 and we will chose a next linear group for CR >= 2.

> /*
> * TODO: For CR=2, we can arrange groups in an rb tree sorted by
> * bb_free. But until that happens, we should never come here.
> @@ -1030,6 +1155,8 @@ static void
> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
> {
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> + rwlock_t *lfo_locks;
> + struct list_head *lfo_list;
> int i;
>
> for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
> @@ -1042,21 +1169,25 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
> return;
> }
>
> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
> + EXT4_MB_GRP_TEST_IOPS(grp)) {
> + lfo_locks = sbi->s_largest_free_orders_locks_iops;
> + lfo_list = sbi->s_largest_free_orders_list_iops;
> + } else {
> + lfo_locks = sbi->s_mb_largest_free_orders_locks;
> + lfo_list = sbi->s_mb_largest_free_orders;
> + }
> +
> if (grp->bb_largest_free_order >= 0) {
> - write_lock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> + write_lock(&lfo_locks[grp->bb_largest_free_order]);
> list_del_init(&grp->bb_largest_free_order_node);
> - write_unlock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> + write_unlock(&lfo_locks[grp->bb_largest_free_order]);
> }
> grp->bb_largest_free_order = i;
> if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
> - write_lock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> - list_add_tail(&grp->bb_largest_free_order_node,
> - &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> - write_unlock(&sbi->s_mb_largest_free_orders_locks[
> - grp->bb_largest_free_order]);
> + write_lock(&lfo_locks[i]);
> + list_add_tail(&grp->bb_largest_free_order_node, &lfo_list[i]);
> + write_unlock(&lfo_locks[i]);
> }
> }
>
> @@ -2498,6 +2629,10 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
> goto out;
> if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
> goto out;
> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
> + (ac->ac_flags & EXT4_MB_HINT_DATA) && EXT4_MB_GRP_TEST_IOPS(grp) &&
> + !sbi->s_mb_enable_iops_data)
> + goto out;

since we already have a grp information here. Then checking for s_flags
and is redundant here right?


> if (should_lock) {
> __acquire(ext4_group_lock_ptr(sb, group));
> ext4_unlock_group(sb, group);
> @@ -3150,6 +3285,9 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
> INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
> init_rwsem(&meta_group_info[i]->alloc_sem);
> meta_group_info[i]->bb_free_root = RB_ROOT;
> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
> + desc->bg_flags & EXT4_BG_IOPS)
> + EXT4_MB_GRP_SET_IOPS(meta_group_info[i]);
> INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
> INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
> meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
> @@ -3423,6 +3561,26 @@ int ext4_mb_init(struct super_block *sb)
> INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]);
> rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]);
> }
> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) {
> + sbi->s_avg_fragment_size_list_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb),
> + sizeof(struct list_head), GFP_KERNEL);
> + if (!sbi->s_avg_fragment_size_list_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + sbi->s_avg_fragment_size_locks_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> + GFP_KERNEL);
> + if (!sbi->s_avg_fragment_size_locks_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> + INIT_LIST_HEAD(&sbi->s_avg_fragment_size_list_iops[i]);
> + rwlock_init(&sbi->s_avg_fragment_size_locks_iops[i]);
> + }
> + }
> sbi->s_mb_largest_free_orders =
> kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> GFP_KERNEL);
> @@ -3441,6 +3599,27 @@ int ext4_mb_init(struct super_block *sb)
> INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
> rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
> }
> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) {
> + sbi->s_largest_free_orders_list_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb),
> + sizeof(struct list_head), GFP_KERNEL);
> + if (!sbi->s_largest_free_orders_list_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + sbi->s_largest_free_orders_locks_iops =
> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> + GFP_KERNEL);
> + if (!sbi->s_largest_free_orders_locks_iops) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> + INIT_LIST_HEAD(
> + &sbi->s_largest_free_orders_list_iops[i]);
> + rwlock_init(&sbi->s_largest_free_orders_locks_iops[i]);
> + }
> + }
>
> spin_lock_init(&sbi->s_md_lock);
> sbi->s_mb_free_pending = 0;
> @@ -3481,6 +3660,8 @@ int ext4_mb_init(struct super_block *sb)
> sbi->s_mb_group_prealloc, sbi->s_stripe);
> }
>
> + sbi->s_mb_enable_iops_data = 0;
> +
> sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group);
> if (sbi->s_locality_groups == NULL) {
> ret = -ENOMEM;
> @@ -3512,8 +3693,12 @@ int ext4_mb_init(struct super_block *sb)
> out:
> kfree(sbi->s_mb_avg_fragment_size);
> kfree(sbi->s_mb_avg_fragment_size_locks);
> + kfree(sbi->s_avg_fragment_size_list_iops);
> + kfree(sbi->s_avg_fragment_size_locks_iops);
> kfree(sbi->s_mb_largest_free_orders);
> kfree(sbi->s_mb_largest_free_orders_locks);
> + kfree(sbi->s_largest_free_orders_list_iops);
> + kfree(sbi->s_largest_free_orders_locks_iops);
> kfree(sbi->s_mb_offsets);
> sbi->s_mb_offsets = NULL;
> kfree(sbi->s_mb_maxs);
> @@ -3582,8 +3767,12 @@ int ext4_mb_release(struct super_block *sb)
> }
> kfree(sbi->s_mb_avg_fragment_size);
> kfree(sbi->s_mb_avg_fragment_size_locks);
> + kfree(sbi->s_avg_fragment_size_list_iops);
> + kfree(sbi->s_avg_fragment_size_locks_iops);
> kfree(sbi->s_mb_largest_free_orders);
> kfree(sbi->s_mb_largest_free_orders_locks);
> + kfree(sbi->s_largest_free_orders_list_iops);
> + kfree(sbi->s_largest_free_orders_locks_iops);
> kfree(sbi->s_mb_offsets);
> kfree(sbi->s_mb_maxs);
> iput(sbi->s_buddy_cache);
> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> index 3042bc605bbf..86ab6c4ed3b8 100644
> --- a/fs/ext4/sysfs.c
> +++ b/fs/ext4/sysfs.c
> @@ -245,6 +245,7 @@ EXT4_ATTR(journal_task, 0444, journal_task);
> EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch);
> EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
> EXT4_RW_ATTR_SBI_UL(last_trim_minblks, s_last_trim_minblks);
> +EXT4_RW_ATTR_SBI_UI(mb_enable_iops_data, s_mb_enable_iops_data);
>
> static unsigned int old_bump_val = 128;
> EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val);
> @@ -295,6 +296,7 @@ static struct attribute *ext4_attrs[] = {
> ATTR_LIST(mb_prefetch),
> ATTR_LIST(mb_prefetch_limit),
> ATTR_LIST(last_trim_minblks),
> + ATTR_LIST(mb_enable_iops_data),
> NULL,
> };
> ATTRIBUTE_GROUPS(ext4);
> @@ -318,6 +320,7 @@ EXT4_ATTR_FEATURE(fast_commit);
> #if IS_ENABLED(CONFIG_UNICODE) && defined(CONFIG_FS_ENCRYPTION)
> EXT4_ATTR_FEATURE(encrypted_casefold);
> #endif
> +EXT4_ATTR_FEATURE(iops);
>
> static struct attribute *ext4_feat_attrs[] = {
> ATTR_LIST(lazy_itable_init),
> @@ -338,6 +341,7 @@ static struct attribute *ext4_feat_attrs[] = {
> #if IS_ENABLED(CONFIG_UNICODE) && defined(CONFIG_FS_ENCRYPTION)
> ATTR_LIST(encrypted_casefold),
> #endif
> + ATTR_LIST(iops),
> NULL,
> };
> ATTRIBUTE_GROUPS(ext4_feat);
> --
> 2.42.0

2023-09-22 05:06:58

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v3] ext4: optimize metadata allocation for hybrid LUNs

On Sep 19, 2023, at 11:39 PM, Ritesh Harjani (IBM) <[email protected]> wrote:
>
> Bobi Jam <[email protected]> writes:
>
>> With LVM it is possible to create an LV with SSD storage at the
>> beginning of the LV and HDD storage at the end of the LV, and use that
>> to separate ext4 metadata allocations (that need small random IOs)
>> from data allocations (that are better suited for large sequential
>> IOs) depending on the type of underlying storage. Between 0.5-1.0% of
>> the filesystem capacity would need to be high-IOPS storage in order to
>> hold all of the internal metadata.
>>
>> This would improve performance for inode and other metadata access,
>> such as ls, find, e2fsck, and in general improve file access latency,
>> modification, truncate, unlink, transaction commit, etc.
>>
>> This patch split largest free order group lists and average fragment
>> size lists into other two lists for IOPS/fast storage groups, and
>> cr 0 / cr 1 group scanning for metadata block allocation in following
>> order:
>>
>> if (allocate metadata blocks)
>> if (cr == 0)
>> try to find group in largest free order IOPS group list
>> if (cr == 1)
>> try to find group in fragment size IOPS group list
>> if (above two find failed)
>> fall through normal group lists as before
>
> Ok, so we are agreeing that if the iops groups are full, we will
> fallback to non-iops group for metadata.
>
>
>> if (allocate data blocks)
>> try to find group in normal group lists as before
>> if (failed to find group in normal group && mb_enable_iops_data)
>> try to find group in IOPS groups
>
> same here but with mb_enable_iops_data.

Hi Ritesh,
thanks for your review.

Yes, this is in the case of allocating data blocks.

>> Non-metadata block allocation does not allocate from the IOPS groups
>> if non-IOPS groups are not used up.
>
> Sure. At least ENOSPC use case can be handled once mb_enable_iops_data
> is enabled. (for users who might end up using large iops disk)
>
>> Add for mke2fs an option to mark which blocks are in the IOPS region
>> of storage at format time:
>>
>> -E iops=0-1024G,4096-8192G
>>
>> so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
>> group descriptors to decide which groups to allocate dynamic
>> filesystem metadata.
>>
>> Signed-off-by: Bobi Jam <[email protected]
>>
>> --
>> v2->v3: add sysfs mb_enable_iops_data to enable data block allocation
>> from IOPS groups.
>> v1->v2: for metadata block allocation, search in IOPS list then normal
>> list.
>> ---
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 8104a21b001a..a8f21f63f5ff 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -382,6 +382,7 @@ struct flex_groups {
>> #define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */
>> #define EXT4_BG_BLOCK_UNINIT 0x0002 /* Block bitmap not in use */
>> #define EXT4_BG_INODE_ZEROED 0x0004 /* On-disk itable initialized to zero */
>> +#define EXT4_BG_IOPS 0x0010 /* In IOPS/fast storage */
>
> Not related to this patch. But why not 0x0008? Is it reserved for
> anything else?

That is being used by the patches to add persistent TRIM support:

+#define EXT4_BG_TRIMMED 0x0008 /* block group was trimmed */

https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/ ("[V2] e2fsprogs: support EXT2_FLAG_BG_TRIMMED and EXT2_FLAGS_TRACK_TRIM")
https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/ ("[V2] ext4: introduce EXT4_BG_TRIMMED to optimize fstrim")

>> /*
>> * Macro-instructions used to manage group descriptors
>> @@ -1112,6 +1113,8 @@ struct ext4_inode_info {
>> #define EXT2_FLAGS_UNSIGNED_HASH 0x0002 /* Unsigned dirhash in use */
>> #define EXT2_FLAGS_TEST_FILESYS 0x0004 /* to test development code */
>>
>> +#define EXT2_FLAGS_HAS_IOPS 0x0080 /* has IOPS storage */
>> +
>
> same here. Are the flag values in between 0x0004 and 0x0080 are reserved?

+#define EXT2_FLAGS_TRACK_TRIM 0x0008 /* Track trim status in bg */

The 0x10/20/40 flags are reserved in e2fsprogs but are not used by ext4.

>> @@ -1009,11 +1108,37 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
>> return;
>> }
>>
>> + if (alloc_metadata && sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) {
>> + if (*new_cr == 0)
>> + ret = ext4_mb_choose_next_iops_group_cr0(ac, group);
>> + if (!ret && *new_cr < 2)
>> + ret = ext4_mb_choose_next_iops_group_cr1(ac, group);

It looks like this patch is a bit stale since Ojaswin's renaming of the
cr0/cr1 phases to "p2_aligned" and "goal_fast" and "best_avail" names.

> This is a bit confusing here. Say if *new_cr = 0 fails, then we return
> ret = false and fallback to choosing xx_iops_group_cr1(). And say if we
> were able to find a group which satisfies this allocation request we
> return. But the caller never knows that we allocated using cr1 and not
> cr0. Because we never updated *new_cr inside xx_iops_group_crX()

I guess it is a bit messy, since updating new_cr might interfere with the
use of new_cr in the fallthrough to the non-IOPS groups below? In the
"1% IOPS groups" case, doing this extra scan for metadata blocks should
be very fast, since the metadata allocations are almost always one block
(excluding large xattrs), so the only time this would fail is if no IOPS
blocks are free, in which case it is fast since the group lists are empty.

We _could_ have a separate (in effect) cr0_3 and cr0_6 phases before the
non-IOPS group allocation starts to be able to distinguish these cases
(i.e. skip IOPS group scans if they are full) and the fallthrough search
is also having trouble to find a single free block for the metadata, but
I think that is pretty unlikely.

>
>> + if (ret)
>> + return;
>> + /*
>> + * Cannot get metadata group from IOPS storage, fall through
>> + * to slow storage.
>> + */
>> + cond_resched();
>
> Not sure after you fix above case, do we still require cond_resched()
> here. Note we already have one in the for loop which iterates over all
> the groups for a given ac_criteria.

The cond_resched() was here because it calls two "choose_next_groups()"
functions above without returning to the outer loop. However, you are
right that the group search is probably not the CPU heavy part here, so
this could probably be dropped?

>> + }
>> +
>> if (*new_cr == 0) {
>> ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
>> } else if (*new_cr == 1) {
>> ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
>> } else {
>> + /*
>> + * Cannot get data group from slow storage, try IOPS storage
>> + */
>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
>> + !alloc_metadata && sbi->s_mb_enable_iops_data &&
>> + *new_cr == 3) {
>> + if (ac->ac_2order)
>> + ret = ext4_mb_choose_next_iops_group_cr0(ac,
>> + group);
>> + if (!ret)
>> + ext4_mb_choose_next_iops_group_cr1(ac, group);
>> + }
>
> We might never come here in this else case because
> should_optimize_scan() which we check in the beginning of this function
> will return 0 and we will chose a next linear group for CR >= 2.

Hmm, you are right. Just off the bottom of this hunk is a "WARN_ON(1)"
that this code block should never be entered.

Really, the fallback to IOPS groups for regular files should only happen
in case of if *new_cr >= CR_GOAL_ANY_FREE. We don't want "normal" block
allocation to fill up the IOPS groups just because the filesystem is
fragmented and low on space, but only if out of non-IOPS space.

>> /*
>> * TODO: For CR=2, we can arrange groups in an rb tree sorted by
>> * bb_free. But until that happens, we should never come here.
>> @@ -1030,6 +1155,8 @@ static void
>> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
>> {
>> struct ext4_sb_info *sbi = EXT4_SB(sb);
>> + rwlock_t *lfo_locks;
>> + struct list_head *lfo_list;
>> int i;
>>
>> for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
>> @@ -1042,21 +1169,25 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
>> return;
>> }
>>
>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
>> + EXT4_MB_GRP_TEST_IOPS(grp)) {
>> + lfo_locks = sbi->s_largest_free_orders_locks_iops;
>> + lfo_list = sbi->s_largest_free_orders_list_iops;
>> + } else {
>> + lfo_locks = sbi->s_mb_largest_free_orders_locks;
>> + lfo_list = sbi->s_mb_largest_free_orders;
>> + }
>> +
>> if (grp->bb_largest_free_order >= 0) {
>> - write_lock(&sbi->s_mb_largest_free_orders_locks[
>> - grp->bb_largest_free_order]);
>> + write_lock(&lfo_locks[grp->bb_largest_free_order]);
>> list_del_init(&grp->bb_largest_free_order_node);
>> - write_unlock(&sbi->s_mb_largest_free_orders_locks[
>> - grp->bb_largest_free_order]);
>> + write_unlock(&lfo_locks[grp->bb_largest_free_order]);
>> }
>> grp->bb_largest_free_order = i;
>> if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
>> - write_lock(&sbi->s_mb_largest_free_orders_locks[
>> - grp->bb_largest_free_order]);
>> - list_add_tail(&grp->bb_largest_free_order_node,
>> - &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
>> - write_unlock(&sbi->s_mb_largest_free_orders_locks[
>> - grp->bb_largest_free_order]);
>> + write_lock(&lfo_locks[i]);
>> + list_add_tail(&grp->bb_largest_free_order_node, &lfo_list[i]);
>> + write_unlock(&lfo_locks[i]);
>> }
>> }
>>
>> @@ -2498,6 +2629,10 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
>> goto out;
>> if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
>> goto out;
>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
>> + (ac->ac_flags & EXT4_MB_HINT_DATA) && EXT4_MB_GRP_TEST_IOPS(grp) &&
>> + !sbi->s_mb_enable_iops_data)
>> + goto out;
>
> since we already have a grp information here. Then checking for s_flags
> and is redundant here right?

This is intended to stop regular data allocations in IOPS groups that are
found by next_linear_group().

With the change to allow regular data to be allocated in IOPS groups,
there might need to be an extra check added here to see what allocation
phase this is. Should we add *higher* CR_ phases above CR_ANY_FREE to
allow distinguishing between IOPS->regular fallback and regular->IOPS
fallback?


It seems like most of the complexity/issues here have crept in since the
addition of the fallback for regular data allocations in IOPS groups...
I'm not sure if we want to defer that functionality to a separate patch,
or if you have any suggestions on how to clarify this without adding a
lot of complexity?

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2023-09-22 11:19:37

by Dongyang Li

[permalink] [raw]
Subject: Re: [PATCH v3] ext4: optimize metadata allocation for hybrid LUNs

On Thu, 2023-09-21 at 21:27 -0600, Andreas Dilger wrote:
> On Sep 19, 2023, at 11:23 PM, Ritesh Harjani (IBM)
> <[email protected]> wrote:
> >
> > Andreas Dilger <[email protected]> writes:
> >
> > > On Sep 12, 2023, at 12:59 AM, Bobi Jam <[email protected]>
> > > wrote:
> > > >
> > > > With LVM it is possible to create an LV with SSD storage at the
> > > > beginning of the LV and HDD storage at the end of the LV, and
> > > > use that
> > > > to separate ext4 metadata allocations (that need small random
> > > > IOs)
> > > > from data allocations (that are better suited for large
> > > > sequential
> > > > IOs) depending on the type of underlying storage.  Between 0.5-
> > > > 1.0% of
> > > > the filesystem capacity would need to be high-IOPS storage in
> > > > order to
> > > > hold all of the internal metadata.
> > > >
> > > > This would improve performance for inode and other metadata
> > > > access,
> > > > such as ls, find, e2fsck, and in general improve file access
> > > > latency,
> > > > modification, truncate, unlink, transaction commit, etc.
> > > >
> > > > This patch split largest free order group lists and average
> > > > fragment
> > > > size lists into other two lists for IOPS/fast storage groups,
> > > > and
> > > > cr 0 / cr 1 group scanning for metadata block allocation in
> > > > following
> > > > order:
> > > >
> > > > if (allocate metadata blocks)
> > > >     if (cr == 0)
> > > >             try to find group in largest free order IOPS group
> > > > list
> > > >     if (cr == 1)
> > > >             try to find group in fragment size IOPS group list
> > > >     if (above two find failed)
> > > >             fall through normal group lists as before
> > > > if (allocate data blocks)
> > > >     try to find group in normal group lists as before
> > > >     if (failed to find group in normal group &&
> > > > mb_enable_iops_data)
> > > >             try to find group in IOPS groups
> > > >
> > > > Non-metadata block allocation does not allocate from the IOPS
> > > > groups
> > > > if non-IOPS groups are not used up.
> > >
> > > Hi Ritesh,
> > > I believe this updated version of the patch addresses your
> > > original
> > > request that it is possible to allocate blocks from the IOPS
> > > block
> > > groups if the non-IOPS groups are full.  This is currently
> > > disabled
> > > by default, because in cases where the IOPS groups make up only a
> > > small fraction of the space (e.g. < 1% of total capacity) having
> > > data
> > > blocks allocated from this space would not make a big improvement
> > > to the end-user usage of the filesystem, but would semi-
> > > permanently
> > > hurt the ability to allocate metadata into the IOPS groups.
> > >
> > > We discussed on the ext4 concall various options to make this
> > > more
> > > useful (e.g. allowing the root user to allocate from the IOPS
> > > groups
> > > if the filesystem is out of space, having a heuristic to balance
> > > IOPS
> > > vs. non-IOPS allocations for small files, having a BPF rule that
> > > can
> > > specify which UID/GID/procname/filename/etc. can access this
> > > space,
> > > but everyone was reluctant to put any complex policy into the
> > > kernel
> > > to make any decision, since this eventually is wrong for some
> > > usage.
> > >
> > > For now, there is just a simple on/off switch, and if this is
> > > enabled
> > > the IOPS groups are only used when all of the non-IOPS groups are
> > > full.
> > > Any more complex policy can be deferred to a separate patch, I
> > > think.
> >
> > I think having a on/off switch for any user to enable/disable
> > allocation
> > of data from iops groups is good enough for now. Atleast users with
> > larger iops disk space won't run out of ENOSPC if they enable with
> > this feature.
> >
> > So, thanks for addressing it. I am going through the series. I will
> > provide
> > my review comments shortly.
> >
> > Meanwhile, here is my understanding of your usecase. Can you please
> > correct add your inputs to this -
> >
> > 1. You would like to create a FS with a combination of high iops
> > storage
> > disk and non-high iops disk. With high iops disk space to be around
> > 1 %
> > of the total disk capacity. (well this is obvious as it is stated
> > in the
> > patch description itself)
> >
>
> > 2. Since ofcourse ext4 currently does not support multi-drive, so
> > we
> > will use it using LVM and place high iops disk in front.
> >
>
> > 3. Then at the creation of the FS we will use a cmd like this
> >   mkfs.ext4 -O sparse_super2 -E packed_meta_blocks,iops=0-1024G
> > /path/to/lvm
> >
> > Is this understanding right?
>
> Correct.  Note that for filesystems larger than 256 TiB, when the
> group
> descriptor table grows larger than the size of group 0, an few extra
> patches that Dongyang developed are needed to fix the sparse_super2
> option in mke2fs to allow this to pack all metadata at the start of
> the
> device and move the GDT backup to further out.  For example, a 2TiB
> filesystem it would use group 9 as the start of the first GDT backup.
>
> I don't expect this will be a problem for most users, and is somewhat
> an independent issue from the IOPS groups, so it has been kept
> separate.
>
> I couldn't find a version of that patch series pushed to the list,
> but it is in our Gerrit (the first one is already pushed):
>
> https://review.whamcloud.com/52219 ("e2fsck: check all sparse_super
> backups")
> https://review.whamcloud.com/52273 ("mke2fs: set free blocks
> accurately ...")
> https://review.whamcloud.com/52274 ("mke2fs: do not set BLOCK_UNINIT
> ...")
> https://review.whamcloud.com/51295 ("mke2fs: try to pack GDT blocks
> together")
>
> (Dongyang, could you please submit the last three patches in this
> series).
Will post the series when I finish making offline resize working with
the last patch. It needs more work than I expected, e.g. when growing
the filesystem as we want the GDT blocks packed together it could grow
beyond group 0 to backup_bgs[0], which means backup_bgs[0] needs to be
moved.

Cheers
Dongyang
>
> > I have few followup queries as well -
> >
> > 1. What about Thin Provisioned LVM? IIUC, the space in that is
> > pre-allocated, but allocation happens at the time of write and it
> > might
> > so happen that both data/metadata allocations will start to sit in
> > iops/non-iops group randomly?
>
> I think the underlying storage type would be controlled by LVM in
> that
> case.  I don't know what kind of policy options are available with
> thin
> provisioned LVs, but my first thought is "don't do that with IOPS
> groups"
> since there is no way to know/control what the underlying storage is.
>
> > 2. Even in case of taditional LVM, the mapping of the physical
> > blocks
> > can be changed during an overwrite or discard sort of usecase
> > right? So
> > do we have a gurantee of the metadata always sitting on high iops
> > groups
> > after filesystem ages?
>
> No, I don't think that would happen under normal usage.  The PV/LV
> maps
> are static after the LV is created, so overwriting a block at runtime
> with ext4 would give the same type of storage as at mke2fs time.
>
> The exception would be with LVM snapshots, in which case I'd suggest
> to
> use flash PV space for the snapshot (assuming there is enough) to
> avoid
> overhead when blocks are COW'd.  Even so, AFAIK the chunks written to
> the snapshot LV are the *old* blocks and the current blocks are kept
> on
> the main PV, so the IOPS groups would still work properly in this
> case.
>
> > 3. With this options of mkfs to utilize this feature, we do loose
> > the
> > ability to resize right? I am guessing resize will be disabled with
> > sparse_super2 and/or packed_meta_blocks itself?
>
> Online resize was disabled in commit v5.13-rc5-20-gb1489186cc83
> "ext4: add check to prevent attempting to resize an fs with
> sparse_super2".
> However, I think that was a misunderstanding.  It looks like online
> resize
> was getting confused by sparse_super2 together with resize_inode,
> because
> there are only 2 backup group descriptor tables, and resize_inode
> expects
> there to be a bunch more backups.  I suspect resize would "work" if
> resize_inode was disabled completely.
>
> The drawback is that online resize would almost immediately fall back
> to meta_bg (as it does anyway for > 16TiB filesystems anyway), and
> spew
> the GDT blocks and other metadata across the non-IOPS storage device.
> This would "work" (give you a larger filesystem), but is not ideal.
>
>
> I think the long-term solution for this would be to fix the
> interaction
> with sparse_super2, so that the resize_inode could reserve GDT blocks
> on the flash storage for the primary GDT and backup_bgs[0], and also
> backup_bgs[1] would be kept in a group < 2M so that it does not need
> to store 64-bit block numbers.  That would actually allow
> resize_inode
> to work with > 16TiB filesystems and continue to avoid using meta_bg.
>
> For the rest of the static metadata (bitmaps, inode tables) it would
> be
> possible to add more IOPS groups at the end of the current filesystem
> and add a "resize2fs -E iops=x-yG" option to have it allocate the
> static
> metadata from any of the IOPS groups.  That said, it has been a while
> since I looked at the online resize code in the kernel, so I'm not
> sure
> whether it is resize2fs or ext4 that is making these decisions
> anymore.
>
> Cheers, Andreas
>
>
>
>
>

2023-09-22 22:03:55

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v3] ext4: optimize metadata allocation for hybrid LUNs

On Sep 19, 2023, at 11:23 PM, Ritesh Harjani (IBM) <[email protected]> wrote:
>
> Andreas Dilger <[email protected]> writes:
>
>> On Sep 12, 2023, at 12:59 AM, Bobi Jam <[email protected]> wrote:
>>>
>>> With LVM it is possible to create an LV with SSD storage at the
>>> beginning of the LV and HDD storage at the end of the LV, and use that
>>> to separate ext4 metadata allocations (that need small random IOs)
>>> from data allocations (that are better suited for large sequential
>>> IOs) depending on the type of underlying storage. Between 0.5-1.0% of
>>> the filesystem capacity would need to be high-IOPS storage in order to
>>> hold all of the internal metadata.
>>>
>>> This would improve performance for inode and other metadata access,
>>> such as ls, find, e2fsck, and in general improve file access latency,
>>> modification, truncate, unlink, transaction commit, etc.
>>>
>>> This patch split largest free order group lists and average fragment
>>> size lists into other two lists for IOPS/fast storage groups, and
>>> cr 0 / cr 1 group scanning for metadata block allocation in following
>>> order:
>>>
>>> if (allocate metadata blocks)
>>> if (cr == 0)
>>> try to find group in largest free order IOPS group list
>>> if (cr == 1)
>>> try to find group in fragment size IOPS group list
>>> if (above two find failed)
>>> fall through normal group lists as before
>>> if (allocate data blocks)
>>> try to find group in normal group lists as before
>>> if (failed to find group in normal group && mb_enable_iops_data)
>>> try to find group in IOPS groups
>>>
>>> Non-metadata block allocation does not allocate from the IOPS groups
>>> if non-IOPS groups are not used up.
>>
>> Hi Ritesh,
>> I believe this updated version of the patch addresses your original
>> request that it is possible to allocate blocks from the IOPS block
>> groups if the non-IOPS groups are full. This is currently disabled
>> by default, because in cases where the IOPS groups make up only a
>> small fraction of the space (e.g. < 1% of total capacity) having data
>> blocks allocated from this space would not make a big improvement
>> to the end-user usage of the filesystem, but would semi-permanently
>> hurt the ability to allocate metadata into the IOPS groups.
>>
>> We discussed on the ext4 concall various options to make this more
>> useful (e.g. allowing the root user to allocate from the IOPS groups
>> if the filesystem is out of space, having a heuristic to balance IOPS
>> vs. non-IOPS allocations for small files, having a BPF rule that can
>> specify which UID/GID/procname/filename/etc. can access this space,
>> but everyone was reluctant to put any complex policy into the kernel
>> to make any decision, since this eventually is wrong for some usage.
>>
>> For now, there is just a simple on/off switch, and if this is enabled
>> the IOPS groups are only used when all of the non-IOPS groups are full.
>> Any more complex policy can be deferred to a separate patch, I think.
>
> I think having a on/off switch for any user to enable/disable allocation
> of data from iops groups is good enough for now. Atleast users with
> larger iops disk space won't run out of ENOSPC if they enable with this feature.
>
> So, thanks for addressing it. I am going through the series. I will provide
> my review comments shortly.
>
> Meanwhile, here is my understanding of your usecase. Can you please
> correct add your inputs to this -
>
> 1. You would like to create a FS with a combination of high iops storage
> disk and non-high iops disk. With high iops disk space to be around 1 %
> of the total disk capacity. (well this is obvious as it is stated in the
> patch description itself)
>

> 2. Since ofcourse ext4 currently does not support multi-drive, so we
> will use it using LVM and place high iops disk in front.
>

> 3. Then at the creation of the FS we will use a cmd like this
> mkfs.ext4 -O sparse_super2 -E packed_meta_blocks,iops=0-1024G /path/to/lvm
>
> Is this understanding right?

Correct. Note that for filesystems larger than 256 TiB, when the group
descriptor table grows larger than the size of group 0, an few extra
patches that Dongyang developed are needed to fix the sparse_super2
option in mke2fs to allow this to pack all metadata at the start of the
device and move the GDT backup to further out. For example, a 2TiB
filesystem it would use group 9 as the start of the first GDT backup.

I don't expect this will be a problem for most users, and is somewhat
an independent issue from the IOPS groups, so it has been kept separate.

I couldn't find a version of that patch series pushed to the list,
but it is in our Gerrit (the first one is already pushed):

https://review.whamcloud.com/52219 ("e2fsck: check all sparse_super backups")
https://review.whamcloud.com/52273 ("mke2fs: set free blocks accurately ...")
https://review.whamcloud.com/52274 ("mke2fs: do not set BLOCK_UNINIT ...")
https://review.whamcloud.com/51295 ("mke2fs: try to pack GDT blocks together")

(Dongyang, could you please submit the last three patches in this series).

> I have few followup queries as well -
>
> 1. What about Thin Provisioned LVM? IIUC, the space in that is
> pre-allocated, but allocation happens at the time of write and it might
> so happen that both data/metadata allocations will start to sit in
> iops/non-iops group randomly?

I think the underlying storage type would be controlled by LVM in that
case. I don't know what kind of policy options are available with thin
provisioned LVs, but my first thought is "don't do that with IOPS groups"
since there is no way to know/control what the underlying storage is.

> 2. Even in case of taditional LVM, the mapping of the physical blocks
> can be changed during an overwrite or discard sort of usecase right? So
> do we have a gurantee of the metadata always sitting on high iops groups
> after filesystem ages?

No, I don't think that would happen under normal usage. The PV/LV maps
are static after the LV is created, so overwriting a block at runtime
with ext4 would give the same type of storage as at mke2fs time.

The exception would be with LVM snapshots, in which case I'd suggest to
use flash PV space for the snapshot (assuming there is enough) to avoid
overhead when blocks are COW'd. Even so, AFAIK the chunks written to
the snapshot LV are the *old* blocks and the current blocks are kept on
the main PV, so the IOPS groups would still work properly in this case.

> 3. With this options of mkfs to utilize this feature, we do loose the
> ability to resize right? I am guessing resize will be disabled with
> sparse_super2 and/or packed_meta_blocks itself?

Online resize was disabled in commit v5.13-rc5-20-gb1489186cc83
"ext4: add check to prevent attempting to resize an fs with sparse_super2".
However, I think that was a misunderstanding. It looks like online resize
was getting confused by sparse_super2 together with resize_inode, because
there are only 2 backup group descriptor tables, and resize_inode expects
there to be a bunch more backups. I suspect resize would "work" if
resize_inode was disabled completely.

The drawback is that online resize would almost immediately fall back
to meta_bg (as it does anyway for > 16TiB filesystems anyway), and spew
the GDT blocks and other metadata across the non-IOPS storage device.
This would "work" (give you a larger filesystem), but is not ideal.


I think the long-term solution for this would be to fix the interaction
with sparse_super2, so that the resize_inode could reserve GDT blocks
on the flash storage for the primary GDT and backup_bgs[0], and also
backup_bgs[1] would be kept in a group < 2M so that it does not need
to store 64-bit block numbers. That would actually allow resize_inode
to work with > 16TiB filesystems and continue to avoid using meta_bg.

For the rest of the static metadata (bitmaps, inode tables) it would be
possible to add more IOPS groups at the end of the current filesystem
and add a "resize2fs -E iops=x-yG" option to have it allocate the static
metadata from any of the IOPS groups. That said, it has been a while
since I looked at the online resize code in the kernel, so I'm not sure
whether it is resize2fs or ext4 that is making these decisions anymore.

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2023-09-26 03:45:11

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v3] ext4: optimize metadata allocation for hybrid LUNs

Andreas Dilger <[email protected]> writes:

> On Sep 19, 2023, at 11:39 PM, Ritesh Harjani (IBM) <[email protected]> wrote:
>>
>> Bobi Jam <[email protected]> writes:
>>
>>> With LVM it is possible to create an LV with SSD storage at the
>>> beginning of the LV and HDD storage at the end of the LV, and use that
>>> to separate ext4 metadata allocations (that need small random IOs)
>>> from data allocations (that are better suited for large sequential
>>> IOs) depending on the type of underlying storage. Between 0.5-1.0% of
>>> the filesystem capacity would need to be high-IOPS storage in order to
>>> hold all of the internal metadata.
>>>
>>> This would improve performance for inode and other metadata access,
>>> such as ls, find, e2fsck, and in general improve file access latency,
>>> modification, truncate, unlink, transaction commit, etc.
>>>
>>> This patch split largest free order group lists and average fragment
>>> size lists into other two lists for IOPS/fast storage groups, and
>>> cr 0 / cr 1 group scanning for metadata block allocation in following
>>> order:
>>>
>>> if (allocate metadata blocks)
>>> if (cr == 0)
>>> try to find group in largest free order IOPS group list
>>> if (cr == 1)
>>> try to find group in fragment size IOPS group list
>>> if (above two find failed)
>>> fall through normal group lists as before
>>
>> Ok, so we are agreeing that if the iops groups are full, we will
>> fallback to non-iops group for metadata.
>>
>>
>>> if (allocate data blocks)
>>> try to find group in normal group lists as before
>>> if (failed to find group in normal group && mb_enable_iops_data)
>>> try to find group in IOPS groups
>>
>> same here but with mb_enable_iops_data.
>
> Hi Ritesh,
> thanks for your review.
>
> Yes, this is in the case of allocating data blocks.
>
>>> Non-metadata block allocation does not allocate from the IOPS groups
>>> if non-IOPS groups are not used up.
>>
>> Sure. At least ENOSPC use case can be handled once mb_enable_iops_data
>> is enabled. (for users who might end up using large iops disk)
>>
>>> Add for mke2fs an option to mark which blocks are in the IOPS region
>>> of storage at format time:
>>>
>>> -E iops=0-1024G,4096-8192G
>>>
>>> so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
>>> group descriptors to decide which groups to allocate dynamic
>>> filesystem metadata.
>>>
>>> Signed-off-by: Bobi Jam <[email protected]
>>>
>>> --
>>> v2->v3: add sysfs mb_enable_iops_data to enable data block allocation
>>> from IOPS groups.
>>> v1->v2: for metadata block allocation, search in IOPS list then normal
>>> list.
>>> ---
>>>
>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>> index 8104a21b001a..a8f21f63f5ff 100644
>>> --- a/fs/ext4/ext4.h
>>> +++ b/fs/ext4/ext4.h
>>> @@ -382,6 +382,7 @@ struct flex_groups {
>>> #define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */
>>> #define EXT4_BG_BLOCK_UNINIT 0x0002 /* Block bitmap not in use */
>>> #define EXT4_BG_INODE_ZEROED 0x0004 /* On-disk itable initialized to zero */
>>> +#define EXT4_BG_IOPS 0x0010 /* In IOPS/fast storage */
>>
>> Not related to this patch. But why not 0x0008? Is it reserved for
>> anything else?
>
> That is being used by the patches to add persistent TRIM support:
>
> +#define EXT4_BG_TRIMMED 0x0008 /* block group was trimmed */
>
> https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/ ("[V2] e2fsprogs: support EXT2_FLAG_BG_TRIMMED and EXT2_FLAGS_TRACK_TRIM")
> https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/ ("[V2] ext4: introduce EXT4_BG_TRIMMED to optimize fstrim")
>
>>> /*
>>> * Macro-instructions used to manage group descriptors
>>> @@ -1112,6 +1113,8 @@ struct ext4_inode_info {
>>> #define EXT2_FLAGS_UNSIGNED_HASH 0x0002 /* Unsigned dirhash in use */
>>> #define EXT2_FLAGS_TEST_FILESYS 0x0004 /* to test development code */
>>>
>>> +#define EXT2_FLAGS_HAS_IOPS 0x0080 /* has IOPS storage */
>>> +
>>
>> same here. Are the flag values in between 0x0004 and 0x0080 are reserved?
>
> +#define EXT2_FLAGS_TRACK_TRIM 0x0008 /* Track trim status in bg */
>
> The 0x10/20/40 flags are reserved in e2fsprogs but are not used by ext4.
>

Thanks for the info on the flags part.

>>> @@ -1009,11 +1108,37 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
>>> return;
>>> }
>>>
>>> + if (alloc_metadata && sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) {
>>> + if (*new_cr == 0)
>>> + ret = ext4_mb_choose_next_iops_group_cr0(ac, group);
>>> + if (!ret && *new_cr < 2)
>>> + ret = ext4_mb_choose_next_iops_group_cr1(ac, group);
>
> It looks like this patch is a bit stale since Ojaswin's renaming of the
> cr0/cr1 phases to "p2_aligned" and "goal_fast" and "best_avail" names.
>

Yup. We should rebase our development effort to latest tree.

>> This is a bit confusing here. Say if *new_cr = 0 fails, then we return
>> ret = false and fallback to choosing xx_iops_group_cr1(). And say if we
>> were able to find a group which satisfies this allocation request we
>> return. But the caller never knows that we allocated using cr1 and not
>> cr0. Because we never updated *new_cr inside xx_iops_group_crX()
>
> I guess it is a bit messy, since updating new_cr might interfere with the
> use of new_cr in the fallthrough to the non-IOPS groups below? In the
> "1% IOPS groups" case, doing this extra scan for metadata blocks should
> be very fast, since the metadata allocations are almost always one block
> (excluding large xattrs), so the only time this would fail is if no IOPS
> blocks are free, in which case it is fast since the group lists are empty.
>

What I was suggesting was we never update *new_cr value when we were
able to find a suitable group. That will confuse the two for loops we
have in the caller. We might as well just update the *new_cr value once
we have identified a suitable group in cr0 or cr1 before returning.

> We _could_ have a separate (in effect) cr0_3 and cr0_6 phases before the
> non-IOPS group allocation starts to be able to distinguish these cases
> (i.e. skip IOPS group scans if they are full) and the fallthrough search
> is also having trouble to find a single free block for the metadata, but
> I think that is pretty unlikely.
>
>>
>>> + if (ret)
>>> + return;
>>> + /*
>>> + * Cannot get metadata group from IOPS storage, fall through
>>> + * to slow storage.
>>> + */
>>> + cond_resched();
>>
>> Not sure after you fix above case, do we still require cond_resched()
>> here. Note we already have one in the for loop which iterates over all
>> the groups for a given ac_criteria.
>
> The cond_resched() was here because it calls two "choose_next_groups()"
> functions above without returning to the outer loop. However, you are
> right that the group search is probably not the CPU heavy part here, so
> this could probably be dropped?
>

Sure make sense. Since for iops group scanning we look into both cr0 &
cr1 and then we fallback to non-iops group, so we may as well keep it.

>>> + }
>>> +
>>> if (*new_cr == 0) {
>>> ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
>>> } else if (*new_cr == 1) {
>>> ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
>>> } else {
>>> + /*
>>> + * Cannot get data group from slow storage, try IOPS storage
>>> + */
>>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
>>> + !alloc_metadata && sbi->s_mb_enable_iops_data &&
>>> + *new_cr == 3) {
>>> + if (ac->ac_2order)
>>> + ret = ext4_mb_choose_next_iops_group_cr0(ac,
>>> + group);
>>> + if (!ret)
>>> + ext4_mb_choose_next_iops_group_cr1(ac, group);
>>> + }
>>
>> We might never come here in this else case because
>> should_optimize_scan() which we check in the beginning of this function
>> will return 0 and we will chose a next linear group for CR >= 2.
>
> Hmm, you are right. Just off the bottom of this hunk is a "WARN_ON(1)"
> that this code block should never be entered.

right.

>
> Really, the fallback to IOPS groups for regular files should only happen
> in case of if *new_cr >= CR_GOAL_ANY_FREE. We don't want "normal" block
> allocation to fill up the IOPS groups just because the filesystem is
> fragmented and low on space, but only if out of non-IOPS space.
>

Sure, I have added some comments later on this policy part...

>>> /*
>>> * TODO: For CR=2, we can arrange groups in an rb tree sorted by
>>> * bb_free. But until that happens, we should never come here.
>>> @@ -1030,6 +1155,8 @@ static void
>>> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
>>> {
>>> struct ext4_sb_info *sbi = EXT4_SB(sb);
>>> + rwlock_t *lfo_locks;
>>> + struct list_head *lfo_list;
>>> int i;
>>>
>>> for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
>>> @@ -1042,21 +1169,25 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
>>> return;
>>> }
>>>
>>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
>>> + EXT4_MB_GRP_TEST_IOPS(grp)) {
>>> + lfo_locks = sbi->s_largest_free_orders_locks_iops;
>>> + lfo_list = sbi->s_largest_free_orders_list_iops;
>>> + } else {
>>> + lfo_locks = sbi->s_mb_largest_free_orders_locks;
>>> + lfo_list = sbi->s_mb_largest_free_orders;
>>> + }
>>> +
>>> if (grp->bb_largest_free_order >= 0) {
>>> - write_lock(&sbi->s_mb_largest_free_orders_locks[
>>> - grp->bb_largest_free_order]);
>>> + write_lock(&lfo_locks[grp->bb_largest_free_order]);
>>> list_del_init(&grp->bb_largest_free_order_node);
>>> - write_unlock(&sbi->s_mb_largest_free_orders_locks[
>>> - grp->bb_largest_free_order]);
>>> + write_unlock(&lfo_locks[grp->bb_largest_free_order]);
>>> }
>>> grp->bb_largest_free_order = i;
>>> if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
>>> - write_lock(&sbi->s_mb_largest_free_orders_locks[
>>> - grp->bb_largest_free_order]);
>>> - list_add_tail(&grp->bb_largest_free_order_node,
>>> - &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
>>> - write_unlock(&sbi->s_mb_largest_free_orders_locks[
>>> - grp->bb_largest_free_order]);
>>> + write_lock(&lfo_locks[i]);
>>> + list_add_tail(&grp->bb_largest_free_order_node, &lfo_list[i]);
>>> + write_unlock(&lfo_locks[i]);
>>> }
>>> }
>>>
>>> @@ -2498,6 +2629,10 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
>>> goto out;
>>> if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
>>> goto out;
>>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
>>> + (ac->ac_flags & EXT4_MB_HINT_DATA) && EXT4_MB_GRP_TEST_IOPS(grp) &&
>>> + !sbi->s_mb_enable_iops_data)
>>> + goto out;
>>
>> since we already have a grp information here. Then checking for s_flags
>> and is redundant here right?
>
> This is intended to stop regular data allocations in IOPS groups that are
> found by next_linear_group().

What I meant is EXT4_MB_GRP_TEST_IOPS(grp), will only be true when we
have sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS as true right?
So do we still need to keep both conditions here?.

>
> With the change to allow regular data to be allocated in IOPS groups,
> there might need to be an extra check added here to see what allocation
> phase this is. Should we add *higher* CR_ phases above CR_ANY_FREE to
> allow distinguishing between IOPS->regular fallback and regular->IOPS
> fallback?
>
>
> It seems like most of the complexity/issues here have crept in since the
> addition of the fallback for regular data allocations in IOPS groups...
> I'm not sure if we want to defer that functionality to a separate patch,
> or if you have any suggestions on how to clarify this without adding a
> lot of complexity?
>

I agree that the separation is not clear. I think it would have been
better if we would have split that functionality in 2 separate patches.
The 1st patch just adds the functionality that you intended i.e.

1. metadata allocations should happen via IOPS group and only if there
is no space left in IOPS group it will fallback to non-IOPS group.
This 1st patch also have data allocations coming only from non-iops group.

and the second patch can have details of...
2. adding a knob which can allow users to fallback data allocations to IOPS group too.

If you think you would like to defer the second patch to later to avoid
the complexity, I am fine with that too. The reason is we should still
think upon what should be the fallback critera for that. Should we do it
when we absolutely have no space in non-IOPS group (cr >= CR_ANY_FREE)
or is it ok to fallback even earlier. I guess it will also depend upon
the information of how many groups are IOPS v/s non-IOPS.

I don't think we are keeping that information anywhere on disk right?
(no. of IOPS v/s non-IOPS groups). That means we might have to do that
at runtime. Once we have that information, the filesystem can better
decide on when should the fallback happen.

So I agree, we need to more discussion and think it through. I guess Ted
was also suggesting the same on the call. Feel free to defer the
fallback of data allocations to non-IOPS group for a later time (If
we don't have a strong objection from others on this).

Thanks
-ritesh

2023-09-26 19:26:53

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v3] ext4: optimize metadata allocation for hybrid LUNs

On Sep 25, 2023, at 9:35 PM, Ritesh Harjani (IBM) <[email protected]> wrote:
>
> Andreas Dilger <[email protected]> writes:
>
>> On Sep 19, 2023, at 11:39 PM, Ritesh Harjani (IBM) <[email protected]> wrote:
>>>
>>> Bobi Jam <[email protected]> writes:
>>>
>>>> Non-metadata block allocation does not allocate from the IOPS groups
>>>> if non-IOPS groups are not used up.
>>>
>>>> Add for mke2fs an option to mark which blocks are in the IOPS region
>>>> of storage at format time:
>>>>
>>>> -E iops=0-1024G,4096-8192G
>>>>
>>>> so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
>>>> group descriptors to decide which groups to allocate dynamic
>>>> filesystem metadata.
>>>>
>>>> Signed-off-by: Bobi Jam <[email protected]
>>>>
>>>> --
>>>> v2->v3: add sysfs mb_enable_iops_data to enable data block allocation
>>>> from IOPS groups.
>>>> v1->v2: for metadata block allocation, search in IOPS list then normal
>>>> list.
>>>> ---
>>>>
>
>>>> @@ -1009,11 +1108,37 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
>>>> return;
>>>> }
>>>>
>>>> + if (alloc_metadata && sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) {
>>>> + if (*new_cr == 0)
>>>> + ret = ext4_mb_choose_next_iops_group_cr0(ac, group);
>>>> + if (!ret && *new_cr < 2)
>>>> + ret = ext4_mb_choose_next_iops_group_cr1(ac, group);
>>
>> It looks like this patch is a bit stale since Ojaswin's renaming of the
>> cr0/cr1 phases to "p2_aligned" and "goal_fast" and "best_avail" names.
>>
>
> Yup. We should rebase our development effort to latest tree.
>
>>> This is a bit confusing here. Say if *new_cr = 0 fails, then we return
>>> ret = false and fallback to choosing xx_iops_group_cr1(). And say if we
>>> were able to find a group which satisfies this allocation request we
>>> return. But the caller never knows that we allocated using cr1 and not
>>> cr0. Because we never updated *new_cr inside xx_iops_group_crX()
>>
>> I guess it is a bit messy, since updating new_cr might interfere with the
>> use of new_cr in the fallthrough to the non-IOPS groups below? In the
>> "1% IOPS groups" case, doing this extra scan for metadata blocks should
>> be very fast, since the metadata allocations are almost always one block
>> (excluding large xattrs), so the only time this would fail is if no IOPS
>> blocks are free, in which case it is fast since the group lists are empty.
>>
>
> What I was suggesting was we never update *new_cr value when we were
> able to find a suitable group. That will confuse the two for loops we
> have in the caller. We might as well just update the *new_cr value once
> we have identified a suitable group in cr0 or cr1 before returning.
>
>> We _could_ have a separate (in effect) cr0_3 and cr0_6 phases before the
>> non-IOPS group allocation starts to be able to distinguish these cases
>> (i.e. skip IOPS group scans if they are full) and the fallthrough search
>> is also having trouble to find a single free block for the metadata, but
>> I think that is pretty unlikely.

I'm not clear which option you prefer here?
- update *new_cr based on the scan in the IOPS groups (in which case the
fallback to non-IOPS groups would start at a higher CR than necessary)
- add new phases *before* CR_POWER2_ALIGNED, e.g. "CR_IOPS_POWER2_ALIGNED"
"CR_IOPS_CR_GOAL_LEN_FAST" and "CR_IOPS_CR_GOAL_LEN_SLOW" to do either
a fast scan or a slow scan on the IOPS groups, and then fall back to
non-IOPS groups? They would be skipped if no IOPS groups exist.

The second option allows preserving the CR value across the loops, in case
the group returned is not suitable for some reason, without confusing whether
the lookup is being done for IOPS groups or not. Also, it makes sense to
have a "SLOW" pass on the IOPS groups, instead of just "FAST", to ensure
that all of the IOPS groups have been tried. This should be very rare
since most allocations (excluding xattrs) are only one block long.

Ojaswin, do you have any input here? You've been doing somework on the
mballoc code recently, and it would be good to get this aligned with what
you are doing/planning.

>>>> if (*new_cr == 0) {
>>>> ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
>>>> } else if (*new_cr == 1) {
>>>> ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
>>>> } else {
>>>> + /*
>>>> + * Cannot get data group from slow storage, try IOPS storage
>>>> + */
>>>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
>>>> + !alloc_metadata && sbi->s_mb_enable_iops_data &&
>>>> + *new_cr == 3) {
>>>> + if (ac->ac_2order)
>>>> + ret = ext4_mb_choose_next_iops_group_cr0(ac,
>>>> + group);
>>>> + if (!ret)
>>>> + ext4_mb_choose_next_iops_group_cr1(ac, group);
>>>> + }
>>>
>>> We might never come here in this else case because
>>> should_optimize_scan() which we check in the beginning of this function
>>> will return 0 and we will chose a next linear group for CR >= 2.
>>
>> Hmm, you are right. Just off the bottom of this hunk is a "WARN_ON(1)"
>> that this code block should never be entered.
>
> right.
>
>>
>> Really, the fallback to IOPS groups for regular files should only happen
>> in case of if *new_cr >= CR_GOAL_ANY_FREE. We don't want "normal" block
>> allocation to fill up the IOPS groups just because the filesystem is
>> fragmented and low on space, but only if out of non-IOPS space.
>>
>
> Sure, I have added some comments later on this policy part...
>
>>>>
>>>> @@ -2498,6 +2629,10 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
>>>> goto out;
>>>> if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
>>>> goto out;
>>>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
>>>> + (ac->ac_flags & EXT4_MB_HINT_DATA) && EXT4_MB_GRP_TEST_IOPS(grp) &&
>>>> + !sbi->s_mb_enable_iops_data)
>>>> + goto out;
>>>
>>> since we already have a grp information here. Then checking for s_flags
>>> and is redundant here right?
>>
>> This is intended to stop regular data allocations in IOPS groups that are
>> found by next_linear_group().
>
> What I meant is EXT4_MB_GRP_TEST_IOPS(grp), will only be true when we
> have sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS as true right?
> So do we still need to keep both conditions here?

Well, EXT2_FLAGS_HAS_IOPS determines whether this functionality is enabled
or not, while the GRP_TEST_IOPS check is for the individual group. So if
the feature is totally disabled (no EXT2_FLAGS_HAS_IOPS) then the per-group
bit should be ignored.

>> With the change to allow regular data to be allocated in IOPS groups,
>> there might need to be an extra check added here to see what allocation
>> phase this is. Should we add *higher* CR_ phases above CR_ANY_FREE to
>> allow distinguishing between IOPS->regular fallback and regular->IOPS
>> fallback?
>>
>>
>> It seems like most of the complexity/issues here have crept in since the
>> addition of the fallback for regular data allocations in IOPS groups...
>> I'm not sure if we want to defer that functionality to a separate patch,
>> or if you have any suggestions on how to clarify this without adding a
>> lot of complexity?
>
> I agree that the separation is not clear. I think it would have been
> better if we would have split that functionality in 2 separate patches.
> The 1st patch just adds the functionality that you intended i.e.
>
> 1. metadata allocations should happen via IOPS group and only if there
> is no space left in IOPS group it will fallback to non-IOPS group.
> This 1st patch also have data allocations coming only from non-iops group.
>
> and the second patch can have details of...
> 2. adding a knob which can allow users to fallback data allocations to IOPS group too.

Sure, I would be happy with that. My main goal is to reserve these flags
and get this feature working in a basic fashion, and then more elaborate
policy decisions can be added once there is a demand for it.

> If you think you would like to defer the second patch to later to avoid
> the complexity, I am fine with that too. The reason is we should still
> think upon what should be the fallback critera for that. Should we do it
> when we absolutely have no space in non-IOPS group (cr >= CR_ANY_FREE)
> or is it ok to fallback even earlier. I guess it will also depend upon
> the information of how many groups are IOPS v/s non-IOPS.
>
> I don't think we are keeping that information anywhere on disk right?
> (no. of IOPS v/s non-IOPS groups). That means we might have to do that
> at runtime. Once we have that information, the filesystem can better
> decide on when should the fallback happen.

Mount already scans all of the groups at mount to set the IOPS flags in
the in-memory group_info, so the count of IOPS groups vs. non-IOPS could
be easily determined, if there is a use for this.

> So I agree, we need to more discussion and think it through. I guess Ted
> was also suggesting the same on the call. Feel free to defer the
> fallback of data allocations to non-IOPS group for a later time (If
> we don't have a strong objection from others on this).

Great, thanks for your review and feedback.


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2023-09-27 03:31:02

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v3] ext4: optimize metadata allocation for hybrid LUNs

Andreas Dilger <[email protected]> writes:

> On Sep 25, 2023, at 9:35 PM, Ritesh Harjani (IBM) <[email protected]> wrote:
>>
>> Andreas Dilger <[email protected]> writes:
>>
>>> On Sep 19, 2023, at 11:39 PM, Ritesh Harjani (IBM) <[email protected]> wrote:
>>>>
>>>> Bobi Jam <[email protected]> writes:
>>>>
>>>>> Non-metadata block allocation does not allocate from the IOPS groups
>>>>> if non-IOPS groups are not used up.
>>>>
>>>>> Add for mke2fs an option to mark which blocks are in the IOPS region
>>>>> of storage at format time:
>>>>>
>>>>> -E iops=0-1024G,4096-8192G
>>>>>
>>>>> so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
>>>>> group descriptors to decide which groups to allocate dynamic
>>>>> filesystem metadata.
>>>>>
>>>>> Signed-off-by: Bobi Jam <[email protected]
>>>>>
>>>>> --
>>>>> v2->v3: add sysfs mb_enable_iops_data to enable data block allocation
>>>>> from IOPS groups.
>>>>> v1->v2: for metadata block allocation, search in IOPS list then normal
>>>>> list.
>>>>> ---
>>>>>
>>
>>>>> @@ -1009,11 +1108,37 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
>>>>> return;
>>>>> }
>>>>>
>>>>> + if (alloc_metadata && sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) {
>>>>> + if (*new_cr == 0)
>>>>> + ret = ext4_mb_choose_next_iops_group_cr0(ac, group);
>>>>> + if (!ret && *new_cr < 2)
>>>>> + ret = ext4_mb_choose_next_iops_group_cr1(ac, group);
>>>
>>> It looks like this patch is a bit stale since Ojaswin's renaming of the
>>> cr0/cr1 phases to "p2_aligned" and "goal_fast" and "best_avail" names.
>>>
>>
>> Yup. We should rebase our development effort to latest tree.
>>
>>>> This is a bit confusing here. Say if *new_cr = 0 fails, then we return
>>>> ret = false and fallback to choosing xx_iops_group_cr1(). And say if we
>>>> were able to find a group which satisfies this allocation request we
>>>> return. But the caller never knows that we allocated using cr1 and not
>>>> cr0. Because we never updated *new_cr inside xx_iops_group_crX()
>>>
>>> I guess it is a bit messy, since updating new_cr might interfere with the
>>> use of new_cr in the fallthrough to the non-IOPS groups below? In the
>>> "1% IOPS groups" case, doing this extra scan for metadata blocks should
>>> be very fast, since the metadata allocations are almost always one block
>>> (excluding large xattrs), so the only time this would fail is if no IOPS
>>> blocks are free, in which case it is fast since the group lists are empty.
>>>
>>
>> What I was suggesting was we never update *new_cr value when we were
>> able to find a suitable group. That will confuse the two for loops we
>> have in the caller. We might as well just update the *new_cr value once
>> we have identified a suitable group in cr0 or cr1 before returning.
>>
>>> We _could_ have a separate (in effect) cr0_3 and cr0_6 phases before the
>>> non-IOPS group allocation starts to be able to distinguish these cases
>>> (i.e. skip IOPS group scans if they are full) and the fallthrough search
>>> is also having trouble to find a single free block for the metadata, but
>>> I think that is pretty unlikely.
>
> I'm not clear which option you prefer here?
> - update *new_cr based on the scan in the IOPS groups (in which case the
> fallback to non-IOPS groups would start at a higher CR than necessary)

This obviously won't help, since it will even further slowdown the metdata
allocations when the iops group becomes full.


> - add new phases *before* CR_POWER2_ALIGNED, e.g. "CR_IOPS_POWER2_ALIGNED"
> "CR_IOPS_CR_GOAL_LEN_FAST" and "CR_IOPS_CR_GOAL_LEN_SLOW" to do either
> a fast scan or a slow scan on the IOPS groups, and then fall back to
> non-IOPS groups? They would be skipped if no IOPS groups exist.
>
> The second option allows preserving the CR value across the loops, in case
> the group returned is not suitable for some reason, without confusing whether
> the lookup is being done for IOPS groups or not. Also, it makes sense to
> have a "SLOW" pass on the IOPS groups, instead of just "FAST", to ensure
> that all of the IOPS groups have been tried. This should be very rare
> since most allocations (excluding xattrs) are only one block long.

So how about we add a scan_state to allocation criteria for iops v/s
non-iops scanning. This way we don't add any new crs, but mballoc now
does a stateful scan rather than stateless scan being done today.
(because we now have two different type of groups to select, iops v/s
non-iops).

We can add a function like....
ac->scan_state = ext4_mb_get_scan_state_policy()

...before starting the scan.

For metadata allocations it will return IOPS_SCAN and for data
allocations it can return NON_IOPS_SCAN (more policy decisions can be
added later). This way we don't have to add extra CRs to the enum
criteria. The allocation still happens using the same existing
allocation criteria. If say the allocation using IOPS_SCAN for metadata
fails, we can switch to NON_IOPS_SCAN & restart the scan from cr0 again.

We might also need to maintain some extra state or variable which tells us
which previous scan_state failed to find a suitable group. But I
will leave that as implementation details. It might be useful for data
allocations where we might have a policy to start with either IOPS_SCAN
or NON_IOPS_SCAN first.

I believe this will be a simpler change rather than adding more crs.
Problem in adding more crs at the beginning could be when we have to
roll over from NON_IOPS criteria to IOPS allocation criteria.

Thoughts?

>
> Ojaswin, do you have any input here? You've been doing somework on the
> mballoc code recently, and it would be good to get this aligned with what
> you are doing/planning.
>
>>>>> if (*new_cr == 0) {
>>>>> ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
>>>>> } else if (*new_cr == 1) {
>>>>> ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
>>>>> } else {
>>>>> + /*
>>>>> + * Cannot get data group from slow storage, try IOPS storage
>>>>> + */
>>>>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
>>>>> + !alloc_metadata && sbi->s_mb_enable_iops_data &&
>>>>> + *new_cr == 3) {
>>>>> + if (ac->ac_2order)
>>>>> + ret = ext4_mb_choose_next_iops_group_cr0(ac,
>>>>> + group);
>>>>> + if (!ret)
>>>>> + ext4_mb_choose_next_iops_group_cr1(ac, group);
>>>>> + }
>>>>
>>>> We might never come here in this else case because
>>>> should_optimize_scan() which we check in the beginning of this function
>>>> will return 0 and we will chose a next linear group for CR >= 2.
>>>
>>> Hmm, you are right. Just off the bottom of this hunk is a "WARN_ON(1)"
>>> that this code block should never be entered.
>>
>> right.
>>
>>>
>>> Really, the fallback to IOPS groups for regular files should only happen
>>> in case of if *new_cr >= CR_GOAL_ANY_FREE. We don't want "normal" block
>>> allocation to fill up the IOPS groups just because the filesystem is
>>> fragmented and low on space, but only if out of non-IOPS space.
>>>
>>
>> Sure, I have added some comments later on this policy part...
>>
>>>>>
>>>>> @@ -2498,6 +2629,10 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
>>>>> goto out;
>>>>> if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
>>>>> goto out;
>>>>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
>>>>> + (ac->ac_flags & EXT4_MB_HINT_DATA) && EXT4_MB_GRP_TEST_IOPS(grp) &&
>>>>> + !sbi->s_mb_enable_iops_data)
>>>>> + goto out;
>>>>
>>>> since we already have a grp information here. Then checking for s_flags
>>>> and is redundant here right?
>>>
>>> This is intended to stop regular data allocations in IOPS groups that are
>>> found by next_linear_group().
>>
>> What I meant is EXT4_MB_GRP_TEST_IOPS(grp), will only be true when we
>> have sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS as true right?
>> So do we still need to keep both conditions here?
>
> Well, EXT2_FLAGS_HAS_IOPS determines whether this functionality is enabled
> or not, while the GRP_TEST_IOPS check is for the individual group. So if
> the feature is totally disabled (no EXT2_FLAGS_HAS_IOPS) then the per-group
> bit should be ignored.
>

Ok, thanks for clarification.

>>> With the change to allow regular data to be allocated in IOPS groups,
>>> there might need to be an extra check added here to see what allocation
>>> phase this is. Should we add *higher* CR_ phases above CR_ANY_FREE to
>>> allow distinguishing between IOPS->regular fallback and regular->IOPS
>>> fallback?
>>>
>>>
>>> It seems like most of the complexity/issues here have crept in since the
>>> addition of the fallback for regular data allocations in IOPS groups...
>>> I'm not sure if we want to defer that functionality to a separate patch,
>>> or if you have any suggestions on how to clarify this without adding a
>>> lot of complexity?
>>
>> I agree that the separation is not clear. I think it would have been
>> better if we would have split that functionality in 2 separate patches.
>> The 1st patch just adds the functionality that you intended i.e.
>>
>> 1. metadata allocations should happen via IOPS group and only if there
>> is no space left in IOPS group it will fallback to non-IOPS group.
>> This 1st patch also have data allocations coming only from non-iops group.
>>
>> and the second patch can have details of...
>> 2. adding a knob which can allow users to fallback data allocations to IOPS group too.
>
> Sure, I would be happy with that. My main goal is to reserve these flags
> and get this feature working in a basic fashion, and then more elaborate
> policy decisions can be added once there is a demand for it.
>

Sure make sense.

>> If you think you would like to defer the second patch to later to avoid
>> the complexity, I am fine with that too. The reason is we should still
>> think upon what should be the fallback critera for that. Should we do it
>> when we absolutely have no space in non-IOPS group (cr >= CR_ANY_FREE)
>> or is it ok to fallback even earlier. I guess it will also depend upon
>> the information of how many groups are IOPS v/s non-IOPS.
>>
>> I don't think we are keeping that information anywhere on disk right?
>> (no. of IOPS v/s non-IOPS groups). That means we might have to do that
>> at runtime. Once we have that information, the filesystem can better
>> decide on when should the fallback happen.
>
> Mount already scans all of the groups at mount to set the IOPS flags in
> the in-memory group_info, so the count of IOPS groups vs. non-IOPS could
> be easily determined, if there is a use for this.
>

Yup.


>> So I agree, we need to more discussion and think it through. I guess Ted
>> was also suggesting the same on the call. Feel free to defer the
>> fallback of data allocations to non-IOPS group for a later time (If
>> we don't have a strong objection from others on this).
>
> Great, thanks for your review and feedback.

Thanks for helping with the queries.

-ritesh

2023-09-28 14:51:15

by Ojaswin Mujoo

[permalink] [raw]
Subject: Re: [PATCH v3] ext4: optimize metadata allocation for hybrid LUNs

On Tue, Sep 12, 2023 at 02:59:24PM +0800, Bobi Jam wrote:
> With LVM it is possible to create an LV with SSD storage at the
> beginning of the LV and HDD storage at the end of the LV, and use that
> to separate ext4 metadata allocations (that need small random IOs)
> from data allocations (that are better suited for large sequential
> IOs) depending on the type of underlying storage. Between 0.5-1.0% of
> the filesystem capacity would need to be high-IOPS storage in order to
> hold all of the internal metadata.
>
> This would improve performance for inode and other metadata access,
> such as ls, find, e2fsck, and in general improve file access latency,
> modification, truncate, unlink, transaction commit, etc.
>
> This patch split largest free order group lists and average fragment
> size lists into other two lists for IOPS/fast storage groups, and
> cr 0 / cr 1 group scanning for metadata block allocation in following
> order:
>
> if (allocate metadata blocks)
> if (cr == 0)
> try to find group in largest free order IOPS group list
> if (cr == 1)
> try to find group in fragment size IOPS group list
> if (above two find failed)
> fall through normal group lists as before
> if (allocate data blocks)
> try to find group in normal group lists as before
> if (failed to find group in normal group && mb_enable_iops_data)
> try to find group in IOPS groups
>
> Non-metadata block allocation does not allocate from the IOPS groups
> if non-IOPS groups are not used up.
>
> Add for mke2fs an option to mark which blocks are in the IOPS region
> of storage at format time:
>
> -E iops=0-1024G,4096-8192G
>
> so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
> group descriptors to decide which groups to allocate dynamic
> filesystem metadata.
>
> Signed-off-by: Bobi Jam <[email protected]
>
> --
> v2->v3: add sysfs mb_enable_iops_data to enable data block allocation
> from IOPS groups.
> v1->v2: for metadata block allocation, search in IOPS list then normal
> list.
> ---

Hi Bobi, Andreas,

So I took a look at this patch and the idea is definitely interesting!
I'll add my review comments inline in a separate mail, but just adding
some high level observations in this mail:

1. Since most of the times our metadata allocation would only request
1 block, we will actually end up skipping CR_POWER2_ALIGNED (aka CR0)
since it only works for len >= 2. But I think it's okay cause some
metadata allocaitons like xattrs might benefit from it.

2. We always try the goal group first in ext4_mb_find_by_goal() before
going through the mballoc criterias and I dont think there is any logic
to stop that incase the goal group is non IOPS and metadata is being
allocated. So I think we are relying on the goal finding logic to give
us IOPS blocks as goal for metadata, but does it do that currently?

Thanks!
ojaswin


2023-09-28 18:58:33

by Ojaswin Mujoo

[permalink] [raw]
Subject: Re: [PATCH v3] ext4: optimize metadata allocation for hybrid LUNs

On Wed, Sep 27, 2023 at 08:18:28AM +0530, Ritesh Harjani wrote:
> Andreas Dilger <[email protected]> writes:
>
> > On Sep 25, 2023, at 9:35 PM, Ritesh Harjani (IBM) <[email protected]> wrote:
> >>
> >> Andreas Dilger <[email protected]> writes:
> >>
> >>> On Sep 19, 2023, at 11:39 PM, Ritesh Harjani (IBM) <[email protected]> wrote:
> >>>>
> >>>> Bobi Jam <[email protected]> writes:
> >>>>
> >>>>> Non-metadata block allocation does not allocate from the IOPS groups
> >>>>> if non-IOPS groups are not used up.
> >>>>
> >>>>> Add for mke2fs an option to mark which blocks are in the IOPS region
> >>>>> of storage at format time:
> >>>>>
> >>>>> -E iops=0-1024G,4096-8192G
> >>>>>
> >>>>> so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
> >>>>> group descriptors to decide which groups to allocate dynamic
> >>>>> filesystem metadata.
> >>>>>
> >>>>> Signed-off-by: Bobi Jam <[email protected]
> >>>>>
> >>>>> --
> >>>>> v2->v3: add sysfs mb_enable_iops_data to enable data block allocation
> >>>>> from IOPS groups.
> >>>>> v1->v2: for metadata block allocation, search in IOPS list then normal
> >>>>> list.
> >>>>> ---
> >>>>>
> >>
> >>>>> @@ -1009,11 +1108,37 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> >>>>> return;
> >>>>> }
> >>>>>
> >>>>> + if (alloc_metadata && sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) {
> >>>>> + if (*new_cr == 0)
> >>>>> + ret = ext4_mb_choose_next_iops_group_cr0(ac, group);
> >>>>> + if (!ret && *new_cr < 2)
> >>>>> + ret = ext4_mb_choose_next_iops_group_cr1(ac, group);
> >>>
> >>> It looks like this patch is a bit stale since Ojaswin's renaming of the
> >>> cr0/cr1 phases to "p2_aligned" and "goal_fast" and "best_avail" names.
> >>>
> >>
> >> Yup. We should rebase our development effort to latest tree.
> >>
> >>>> This is a bit confusing here. Say if *new_cr = 0 fails, then we return
> >>>> ret = false and fallback to choosing xx_iops_group_cr1(). And say if we
> >>>> were able to find a group which satisfies this allocation request we
> >>>> return. But the caller never knows that we allocated using cr1 and not
> >>>> cr0. Because we never updated *new_cr inside xx_iops_group_crX()
> >>>
> >>> I guess it is a bit messy, since updating new_cr might interfere with the
> >>> use of new_cr in the fallthrough to the non-IOPS groups below? In the
> >>> "1% IOPS groups" case, doing this extra scan for metadata blocks should
> >>> be very fast, since the metadata allocations are almost always one block
> >>> (excluding large xattrs), so the only time this would fail is if no IOPS
> >>> blocks are free, in which case it is fast since the group lists are empty.
> >>>
> >>
> >> What I was suggesting was we never update *new_cr value when we were
> >> able to find a suitable group. That will confuse the two for loops we
> >> have in the caller. We might as well just update the *new_cr value once
> >> we have identified a suitable group in cr0 or cr1 before returning.
> >>
> >>> We _could_ have a separate (in effect) cr0_3 and cr0_6 phases before the
> >>> non-IOPS group allocation starts to be able to distinguish these cases
> >>> (i.e. skip IOPS group scans if they are full) and the fallthrough search
> >>> is also having trouble to find a single free block for the metadata, but
> >>> I think that is pretty unlikely.
> >
> > I'm not clear which option you prefer here?
> > - update *new_cr based on the scan in the IOPS groups (in which case the
> > fallback to non-IOPS groups would start at a higher CR than necessary)
>
> This obviously won't help, since it will even further slowdown the metdata
> allocations when the iops group becomes full.
>
>
> > - add new phases *before* CR_POWER2_ALIGNED, e.g. "CR_IOPS_POWER2_ALIGNED"
> > "CR_IOPS_CR_GOAL_LEN_FAST" and "CR_IOPS_CR_GOAL_LEN_SLOW" to do either
> > a fast scan or a slow scan on the IOPS groups, and then fall back to
> > non-IOPS groups? They would be skipped if no IOPS groups exist.
> >
> > The second option allows preserving the CR value across the loops, in case
> > the group returned is not suitable for some reason, without confusing whether
> > the lookup is being done for IOPS groups or not. Also, it makes sense to
> > have a "SLOW" pass on the IOPS groups, instead of just "FAST", to ensure
> > that all of the IOPS groups have been tried. This should be very rare
> > since most allocations (excluding xattrs) are only one block long.
>
> So how about we add a scan_state to allocation criteria for iops v/s
> non-iops scanning. This way we don't add any new crs, but mballoc now
> does a stateful scan rather than stateless scan being done today.
> (because we now have two different type of groups to select, iops v/s
> non-iops).
>
> We can add a function like....
> ac->scan_state = ext4_mb_get_scan_state_policy()
>
> ...before starting the scan.
>
> For metadata allocations it will return IOPS_SCAN and for data
> allocations it can return NON_IOPS_SCAN (more policy decisions can be
> added later). This way we don't have to add extra CRs to the enum
> criteria. The allocation still happens using the same existing
> allocation criteria. If say the allocation using IOPS_SCAN for metadata
> fails, we can switch to NON_IOPS_SCAN & restart the scan from cr0 again.
>
> We might also need to maintain some extra state or variable which tells us
> which previous scan_state failed to find a suitable group. But I
> will leave that as implementation details. It might be useful for data
> allocations where we might have a policy to start with either IOPS_SCAN
> or NON_IOPS_SCAN first.
>
> I believe this will be a simpler change rather than adding more crs.
> Problem in adding more crs at the beginning could be when we have to
> roll over from NON_IOPS criteria to IOPS allocation criteria.
>
> Thoughts?
>

Hi Ritesh, Andreas.

So I'm not sure if I'm convinced with the way we are handling CRs in
current implementation. This changes the behavior of how other
ext4_mb_choose_*_cr0/1() functions work that is by updating the cr if we fail
to find a suitable group. The new functions instead return true or false
without updating the CR which, as Ritesh pointed out, leaves some
unhandled edge cases that could cause silent behavior changes.

I like the original way of handling CRs where the logic to update
the CR is mostly in the ext4_mb_choose_*_cr0/1() functions and I'd rather not
move it in the ext4_mb_choose_next_group() like in this patch.

That being said, @Andreas, my thoughts on the 2 ways you've proposed:

> > - update *new_cr based on the scan in the IOPS groups (in which case the
> > fallback to non-IOPS groups would start at a higher CR than necessary)
>

1 So the way I see it, each CR represents the algorithm
to find a good group rather than the data structures we use for them.
Hence, I feel that since the algo of CR0/1 remains the same whether its
IOPS or not, we should not add a new CR for IOPS.

> > - update *new_cr based on the scan in the IOPS groups (in which case the
> > fallback to non-IOPS groups would start at a higher CR than necessary)

2. I would actually prefer this way of doing it. I think this is also
somewhat similar to how we were doing it in PATCH v1. We should keep
using the existing ext4_mb_choose_*_cr0/1() functions but update the CR
based on if its a metadata allocation or not.

I actually like Ritesh's proposal of using a scan state which will then
lead to our psuedo-code looking something like:

ext4_mb_choose_next_group_cr0()
new_cr = 1
ext4_mb_choose_next_group_cr1()
new_cr = 2
/* cr 2 search in outer loop */
if (not found in cr 2) {
if (ac->scan_state == IOPS_SCAN) {
ac->scan_state == NON_IOPS_SCAN;
new_cr = 0;
} else {
ac->scan_state == IOPS_SCAN;
new_cr = 0;
}
goto repeat;
}

With the patch I was working on to shift CR2 to order list [1] and then
have a ext4_mb_choose_next_group_cr2() functions instead of using the
linear loop, we can then maintain an IOPS lists like we do for cr0/1
here and further simplify the above psuedo code for CR2.

[1]
https://lore.kernel.org/linux-ext4/[email protected]/

(In the above discussion i used the older cr0/1/2 notation for
simplicity)

Also, I didn't completely understand this particular statement:

> in which case the fallback to non-IOPS groups would start at a higher
> CR than necessary

I think we can always reset the cr to 0 when we reach the end in IOPS
allocation right like in the psuedo code above, or am I missing
something?

Regards,
ojaswin

> >
> > Ojaswin, do you have any input here? You've been doing somework on the
> > mballoc code recently, and it would be good to get this aligned with what
> > you are doing/planning.
> >
> >>>>> if (*new_cr == 0) {
> >>>>> ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
> >>>>> } else if (*new_cr == 1) {
> >>>>> ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
> >>>>> } else {
> >>>>> + /*
> >>>>> + * Cannot get data group from slow storage, try IOPS storage
> >>>>> + */
> >>>>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
> >>>>> + !alloc_metadata && sbi->s_mb_enable_iops_data &&
> >>>>> + *new_cr == 3) {
> >>>>> + if (ac->ac_2order)
> >>>>> + ret = ext4_mb_choose_next_iops_group_cr0(ac,
> >>>>> + group);
> >>>>> + if (!ret)
> >>>>> + ext4_mb_choose_next_iops_group_cr1(ac, group);
> >>>>> + }
> >>>>
> >>>> We might never come here in this else case because
> >>>> should_optimize_scan() which we check in the beginning of this function
> >>>> will return 0 and we will chose a next linear group for CR >= 2.
> >>>
> >>> Hmm, you are right. Just off the bottom of this hunk is a "WARN_ON(1)"
> >>> that this code block should never be entered.
> >>
> >> right.
> >>
> >>>
> >>> Really, the fallback to IOPS groups for regular files should only happen
> >>> in case of if *new_cr >= CR_GOAL_ANY_FREE. We don't want "normal" block
> >>> allocation to fill up the IOPS groups just because the filesystem is
> >>> fragmented and low on space, but only if out of non-IOPS space.
> >>>
> >>
> >> Sure, I have added some comments later on this policy part...
> >>
> >>>>>
> >>>>> @@ -2498,6 +2629,10 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
> >>>>> goto out;
> >>>>> if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
> >>>>> goto out;
> >>>>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
> >>>>> + (ac->ac_flags & EXT4_MB_HINT_DATA) && EXT4_MB_GRP_TEST_IOPS(grp) &&
> >>>>> + !sbi->s_mb_enable_iops_data)
> >>>>> + goto out;
> >>>>
> >>>> since we already have a grp information here. Then checking for s_flags
> >>>> and is redundant here right?
> >>>
> >>> This is intended to stop regular data allocations in IOPS groups that are
> >>> found by next_linear_group().
> >>
> >> What I meant is EXT4_MB_GRP_TEST_IOPS(grp), will only be true when we
> >> have sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS as true right?
> >> So do we still need to keep both conditions here?
> >
> > Well, EXT2_FLAGS_HAS_IOPS determines whether this functionality is enabled
> > or not, while the GRP_TEST_IOPS check is for the individual group. So if
> > the feature is totally disabled (no EXT2_FLAGS_HAS_IOPS) then the per-group
> > bit should be ignored.
> >
>
> Ok, thanks for clarification.
>
> >>> With the change to allow regular data to be allocated in IOPS groups,
> >>> there might need to be an extra check added here to see what allocation
> >>> phase this is. Should we add *higher* CR_ phases above CR_ANY_FREE to
> >>> allow distinguishing between IOPS->regular fallback and regular->IOPS
> >>> fallback?
> >>>
> >>>
> >>> It seems like most of the complexity/issues here have crept in since the
> >>> addition of the fallback for regular data allocations in IOPS groups...
> >>> I'm not sure if we want to defer that functionality to a separate patch,
> >>> or if you have any suggestions on how to clarify this without adding a
> >>> lot of complexity?
> >>
> >> I agree that the separation is not clear. I think it would have been
> >> better if we would have split that functionality in 2 separate patches.
> >> The 1st patch just adds the functionality that you intended i.e.
> >>
> >> 1. metadata allocations should happen via IOPS group and only if there
> >> is no space left in IOPS group it will fallback to non-IOPS group.
> >> This 1st patch also have data allocations coming only from non-iops group.
> >>
> >> and the second patch can have details of...
> >> 2. adding a knob which can allow users to fallback data allocations to IOPS group too.
> >
> > Sure, I would be happy with that. My main goal is to reserve these flags
> > and get this feature working in a basic fashion, and then more elaborate
> > policy decisions can be added once there is a demand for it.
> >
>
> Sure make sense.
>
> >> If you think you would like to defer the second patch to later to avoid
> >> the complexity, I am fine with that too. The reason is we should still
> >> think upon what should be the fallback critera for that. Should we do it
> >> when we absolutely have no space in non-IOPS group (cr >= CR_ANY_FREE)
> >> or is it ok to fallback even earlier. I guess it will also depend upon
> >> the information of how many groups are IOPS v/s non-IOPS.
> >>
> >> I don't think we are keeping that information anywhere on disk right?
> >> (no. of IOPS v/s non-IOPS groups). That means we might have to do that
> >> at runtime. Once we have that information, the filesystem can better
> >> decide on when should the fallback happen.
> >
> > Mount already scans all of the groups at mount to set the IOPS flags in
> > the in-memory group_info, so the count of IOPS groups vs. non-IOPS could
> > be easily determined, if there is a use for this.
> >
>
> Yup.
>
>
> >> So I agree, we need to more discussion and think it through. I guess Ted
> >> was also suggesting the same on the call. Feel free to defer the
> >> fallback of data allocations to non-IOPS group for a later time (If
> >> we don't have a strong objection from others on this).
> >
> > Great, thanks for your review and feedback.
>
> Thanks for helping with the queries.
>
> -ritesh