2021-04-01 18:06:53

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v6 0/7] Block Allocator Improvements

This patch series improves cr 0 and cr 1 passes of the allocator
signficantly. Currently, at cr 0 and 1, we perform linear lookups to
find the matching groups. That's very inefficient for large file
systems where there are millions of block groups. At cr 0, we only
care about the groups that have the largest free order >= the
request's order and at cr 1 we only care about groups where average
fragment size > the request size. so, this patchset introduces new
data structures that allow us to perform cr 0 lookup in constant time
and cr 1 lookup in log (number of groups) time instead of linear.

For cr 0, we add a list for each order and all the groups are enqueued
to the appropriate list based on the largest free order in its buddy
bitmap. This allows us to lookup a match at cr 0 in constant time.

For cr 1, we add a new rb tree of groups sorted by largest fragment
size. This allows us to lookup a match for cr 1 in log (num groups)
time.

These optimizations can be enabled by passing "mb_optimize_scan" mount
option.

These changes may result in allocations to be spread across the block
device. While that would not matter some block devices (such as flash)
it may be a cause of concern for other block devices that benefit from
storing related content togetther such as disk. However, it can be
argued that in high fragmentation scenrio, especially for large disks,
it's still worth optimizing the scanning since in such cases, we get
cpu bound on group scanning instead of getting IO bound. Perhaps, in
future, we could dynamically turn this new optimization on based on
fragmentation levels for such devices.

Verified that there are no regressions in smoke tests (-g quick -c 4k).

Also, to demonstrate the effectiveness for the patch series, following
experiment was performed:

Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:

time dd if=/dev/urandom of=file bs=2M count=10

Here are the results with and without cr 0/1 optimizations:

|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|

The patch [3/6] "ext4: add mballoc stats proc file" is a modified
version of the patch originally written by Artem Blagodarenko
([email protected]). With that patch, I ran following
command with and without optimizations.

dd if=/dev/zero of=/mnt/file bs=2M count=2 conv=fsync

Without optimizations:

useless_c0_loops: 3
useless_c1_loops: 39
useless_c2_loops: 0
useless_c3_loops: 0

With optimizations:

useless_c0_loops: 0
useless_c1_loops: 0
useless_c2_loops: 0
useless_c3_loops: 0

This shows that CR0 and CR1 optimizations get rid of useless CR0 and
CR1 loops altogether thereby significantly reducing the number of
groups that get considered.

Changes from V5:
----------------
- Turned block bitmap prefetching on by default
- Fixed a bug where for cr >= 2, we were skipping first group without
searching in it
- Renamed mb_linear_limit to mb_max_linear_groups

Harshad Shirwadkar (7):
ext4: drop s_mb_bal_lock and convert protected fields to atomic
ext4: add ability to return parsed options from parse_options
ext4: add mballoc stats proc file
ext4: add MB_NUM_ORDERS macro
ext4: improve cr 0 / cr 1 group scanning
ext4: add proc files to monitor new structures
ext4: make prefetch_block_bitmaps default

fs/ext4/ext4.h | 34 ++-
fs/ext4/mballoc.c | 590 +++++++++++++++++++++++++++++++++++++++++++---
fs/ext4/mballoc.h | 22 +-
fs/ext4/super.c | 92 +++++---
fs/ext4/sysfs.c | 6 +
5 files changed, 680 insertions(+), 64 deletions(-)

--
2.31.0.291.g576ba9dcdaf-goog


2021-04-01 18:23:40

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v6 2/7] ext4: add ability to return parsed options from parse_options

Before this patch, the function parse_options() was returning
journal_devnum and journal_ioprio variables to the caller. This patch
generalizes that interface to allow parse_options to return any parsed
options to return back to the caller. In this patch series, it gets
used to capture the value of "mb_optimize_scan=%u" mount option.

Signed-off-by: Harshad Shirwadkar <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/super.c | 51 +++++++++++++++++++++++++++++--------------------
1 file changed, 30 insertions(+), 21 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 071d131fadd8..3a2cd9fb7e73 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2089,9 +2089,14 @@ static int ext4_set_test_dummy_encryption(struct super_block *sb,
return 1;
}

+struct ext4_parsed_options {
+ unsigned long journal_devnum;
+ unsigned int journal_ioprio;
+};
+
static int handle_mount_opt(struct super_block *sb, char *opt, int token,
- substring_t *args, unsigned long *journal_devnum,
- unsigned int *journal_ioprio, int is_remount)
+ substring_t *args, struct ext4_parsed_options *parsed_opts,
+ int is_remount)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
const struct mount_opts *m;
@@ -2248,7 +2253,7 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
"Cannot specify journal on remount");
return -1;
}
- *journal_devnum = arg;
+ parsed_opts->journal_devnum = arg;
} else if (token == Opt_journal_path) {
char *journal_path;
struct inode *journal_inode;
@@ -2284,7 +2289,7 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
return -1;
}

- *journal_devnum = new_encode_dev(journal_inode->i_rdev);
+ parsed_opts->journal_devnum = new_encode_dev(journal_inode->i_rdev);
path_put(&path);
kfree(journal_path);
} else if (token == Opt_journal_ioprio) {
@@ -2293,7 +2298,7 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
" (must be 0-7)");
return -1;
}
- *journal_ioprio =
+ parsed_opts->journal_ioprio =
IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, arg);
} else if (token == Opt_test_dummy_encryption) {
return ext4_set_test_dummy_encryption(sb, opt, &args[0],
@@ -2410,8 +2415,7 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
}

static int parse_options(char *options, struct super_block *sb,
- unsigned long *journal_devnum,
- unsigned int *journal_ioprio,
+ struct ext4_parsed_options *ret_opts,
int is_remount)
{
struct ext4_sb_info __maybe_unused *sbi = EXT4_SB(sb);
@@ -2431,8 +2435,8 @@ static int parse_options(char *options, struct super_block *sb,
*/
args[0].to = args[0].from = NULL;
token = match_token(p, tokens, args);
- if (handle_mount_opt(sb, p, token, args, journal_devnum,
- journal_ioprio, is_remount) < 0)
+ if (handle_mount_opt(sb, p, token, args, ret_opts,
+ is_remount) < 0)
return 0;
}
#ifdef CONFIG_QUOTA
@@ -4014,7 +4018,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
ext4_fsblk_t sb_block = get_sb_block(&data);
ext4_fsblk_t logical_sb_block;
unsigned long offset = 0;
- unsigned long journal_devnum = 0;
unsigned long def_mount_opts;
struct inode *root;
const char *descr;
@@ -4025,8 +4028,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
int needs_recovery, has_huge_files;
__u64 blocks_count;
int err = 0;
- unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
ext4_group_t first_not_zeroed;
+ struct ext4_parsed_options parsed_opts;
+
+ /* Set defaults for the variables that will be set during parsing */
+ parsed_opts.journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
+ parsed_opts.journal_devnum = 0;

if ((data && !orig_data) || !sbi)
goto out_free_base;
@@ -4272,8 +4279,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
GFP_KERNEL);
if (!s_mount_opts)
goto failed_mount;
- if (!parse_options(s_mount_opts, sb, &journal_devnum,
- &journal_ioprio, 0)) {
+ if (!parse_options(s_mount_opts, sb, &parsed_opts, 0)) {
ext4_msg(sb, KERN_WARNING,
"failed to parse options in superblock: %s",
s_mount_opts);
@@ -4281,8 +4287,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
kfree(s_mount_opts);
}
sbi->s_def_mount_opt = sbi->s_mount_opt;
- if (!parse_options((char *) data, sb, &journal_devnum,
- &journal_ioprio, 0))
+ if (!parse_options((char *) data, sb, &parsed_opts, 0))
goto failed_mount;

#ifdef CONFIG_UNICODE
@@ -4773,7 +4778,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
* root first: it may be modified in the journal!
*/
if (!test_opt(sb, NOLOAD) && ext4_has_feature_journal(sb)) {
- err = ext4_load_journal(sb, es, journal_devnum);
+ err = ext4_load_journal(sb, es, parsed_opts.journal_devnum);
if (err)
goto failed_mount3a;
} else if (test_opt(sb, NOLOAD) && !sb_rdonly(sb) &&
@@ -4873,7 +4878,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
goto failed_mount_wq;
}

- set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
+ set_task_ioprio(sbi->s_journal->j_task, parsed_opts.journal_ioprio);

sbi->s_journal->j_submit_inode_data_buffers =
ext4_journal_submit_inode_data_buffers;
@@ -5808,13 +5813,16 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
struct ext4_mount_options old_opts;
int enable_quota = 0;
ext4_group_t g;
- unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
int err = 0;
#ifdef CONFIG_QUOTA
int i, j;
char *to_free[EXT4_MAXQUOTAS];
#endif
char *orig_data = kstrdup(data, GFP_KERNEL);
+ struct ext4_parsed_options parsed_opts;
+
+ parsed_opts.journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
+ parsed_opts.journal_devnum = 0;

if (data && !orig_data)
return -ENOMEM;
@@ -5845,7 +5853,8 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
old_opts.s_qf_names[i] = NULL;
#endif
if (sbi->s_journal && sbi->s_journal->j_task->io_context)
- journal_ioprio = sbi->s_journal->j_task->io_context->ioprio;
+ parsed_opts.journal_ioprio =
+ sbi->s_journal->j_task->io_context->ioprio;

/*
* Some options can be enabled by ext4 and/or by VFS mount flag
@@ -5855,7 +5864,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
vfs_flags = SB_LAZYTIME | SB_I_VERSION;
sb->s_flags = (sb->s_flags & ~vfs_flags) | (*flags & vfs_flags);

- if (!parse_options(data, sb, NULL, &journal_ioprio, 1)) {
+ if (!parse_options(data, sb, &parsed_opts, 1)) {
err = -EINVAL;
goto restore_opts;
}
@@ -5905,7 +5914,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)

if (sbi->s_journal) {
ext4_init_journal_params(sb, sbi->s_journal);
- set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
+ set_task_ioprio(sbi->s_journal->j_task, parsed_opts.journal_ioprio);
}

/* Flush outstanding errors before changing fs state */
--
2.31.0.291.g576ba9dcdaf-goog

2021-04-09 16:36:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] Block Allocator Improvements

Thanks, I've applied this patch series into the ext4 git tree.

- Ted

On Thu, Apr 01, 2021 at 10:21:22AM -0700, Harshad Shirwadkar wrote:
> This patch series improves cr 0 and cr 1 passes of the allocator
> signficantly. Currently, at cr 0 and 1, we perform linear lookups to
> find the matching groups. That's very inefficient for large file
> systems where there are millions of block groups. At cr 0, we only
> care about the groups that have the largest free order >= the
> request's order and at cr 1 we only care about groups where average
> fragment size > the request size. so, this patchset introduces new
> data structures that allow us to perform cr 0 lookup in constant time
> and cr 1 lookup in log (number of groups) time instead of linear.
>
> For cr 0, we add a list for each order and all the groups are enqueued
> to the appropriate list based on the largest free order in its buddy
> bitmap. This allows us to lookup a match at cr 0 in constant time.
>
> For cr 1, we add a new rb tree of groups sorted by largest fragment
> size. This allows us to lookup a match for cr 1 in log (num groups)
> time.
>
> These optimizations can be enabled by passing "mb_optimize_scan" mount
> option.
>
> These changes may result in allocations to be spread across the block
> device. While that would not matter some block devices (such as flash)
> it may be a cause of concern for other block devices that benefit from
> storing related content togetther such as disk. However, it can be
> argued that in high fragmentation scenrio, especially for large disks,
> it's still worth optimizing the scanning since in such cases, we get
> cpu bound on group scanning instead of getting IO bound. Perhaps, in
> future, we could dynamically turn this new optimization on based on
> fragmentation levels for such devices.
>
> Verified that there are no regressions in smoke tests (-g quick -c 4k).
>
> Also, to demonstrate the effectiveness for the patch series, following
> experiment was performed:
>
> Created a highly fragmented disk of size 65TB. The disk had no
> contiguous 2M regions. Following command was run consecutively for 3
> times:
>
> time dd if=/dev/urandom of=file bs=2M count=10
>
> Here are the results with and without cr 0/1 optimizations:
>
> |---------+------------------------------+---------------------------|
> | | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
> |---------+------------------------------+---------------------------|
> | 1st run | 5m1.871s | 2m47.642s |
> | 2nd run | 2m28.390s | 0m0.611s |
> | 3rd run | 2m26.530s | 0m1.255s |
> |---------+------------------------------+---------------------------|
>
> The patch [3/6] "ext4: add mballoc stats proc file" is a modified
> version of the patch originally written by Artem Blagodarenko
> ([email protected]). With that patch, I ran following
> command with and without optimizations.
>
> dd if=/dev/zero of=/mnt/file bs=2M count=2 conv=fsync
>
> Without optimizations:
>
> useless_c0_loops: 3
> useless_c1_loops: 39
> useless_c2_loops: 0
> useless_c3_loops: 0
>
> With optimizations:
>
> useless_c0_loops: 0
> useless_c1_loops: 0
> useless_c2_loops: 0
> useless_c3_loops: 0
>
> This shows that CR0 and CR1 optimizations get rid of useless CR0 and
> CR1 loops altogether thereby significantly reducing the number of
> groups that get considered.
>
> Changes from V5:
> ----------------
> - Turned block bitmap prefetching on by default
> - Fixed a bug where for cr >= 2, we were skipping first group without
> searching in it
> - Renamed mb_linear_limit to mb_max_linear_groups
>
> Harshad Shirwadkar (7):
> ext4: drop s_mb_bal_lock and convert protected fields to atomic
> ext4: add ability to return parsed options from parse_options
> ext4: add mballoc stats proc file
> ext4: add MB_NUM_ORDERS macro
> ext4: improve cr 0 / cr 1 group scanning
> ext4: add proc files to monitor new structures
> ext4: make prefetch_block_bitmaps default
>
> fs/ext4/ext4.h | 34 ++-
> fs/ext4/mballoc.c | 590 +++++++++++++++++++++++++++++++++++++++++++---
> fs/ext4/mballoc.h | 22 +-
> fs/ext4/super.c | 92 +++++---
> fs/ext4/sysfs.c | 6 +
> 5 files changed, 680 insertions(+), 64 deletions(-)
>
> --
> 2.31.0.291.g576ba9dcdaf-goog
>