From: Michal Rostecki <[email protected]>
This patch series adds a new raid1 read policy - roundrobin. For each
request, it selects the mirror which has lower load than queue depth.
Load is defined as the number of inflight requests + a penalty value
(if the scheduled request is not local to the last processed request for
a rotational disk).
The series consists of preparational changes which add necessary
information to the btrfs_device struct and the change with the policy.
This policy was tested with fio and compared with the default `pid`
policy.
The singlethreaded test has the following parameters:
[global]
name=btrfs-raid1-seqread
filename=btrfs-raid1-seqread
rw=read
bs=64k
direct=0
numjobs=1
time_based=0
[file1]
size=10G
ioengine=libaio
and shows the following results:
- raid1c3 with 3 HDDs:
3 x Segate Barracuda ST2000DM008 (2TB)
* pid policy
READ: bw=217MiB/s (228MB/s), 217MiB/s-217MiB/s (228MB/s-228MB/s),
io=10.0GiB (10.7GB), run=47082-47082msec
* roundrobin policy
READ: bw=409MiB/s (429MB/s), 409MiB/s-409MiB/s (429MB/s-429MB/s),
io=10.0GiB (10.7GB), run=25028-25028mse
- raid1c3 with 2 HDDs and 1 SSD:
2 x Segate Barracuda ST2000DM008 (2TB)
1 x Crucial CT256M550SSD1 (256GB)
* pid policy (the worst case when only HDDs were chosen)
READ: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s),
io=10.0GiB (10.7GB), run=46577-46577mse
* pid policy (the best case when SSD was used as well)
READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s),
io=10.0GiB (10.7GB), run=19954-19954msec
* roundrobin (there are no noticeable differences when testing multiple
times)
READ: bw=541MiB/s (567MB/s), 541MiB/s-541MiB/s (567MB/s-567MB/s),
io=10.0GiB (10.7GB), run=18933-18933msec
The multithreaded test has the following parameters:
[global]
name=btrfs-raid1-seqread
filename=btrfs-raid1-seqread
rw=read
bs=64k
direct=0
numjobs=8
time_based=0
[file1]
size=10G
ioengine=libaio
and shows the following results:
- raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB)
3 x Segate Barracuda ST2000DM008 (2TB)
* pid policy
READ: bw=1569MiB/s (1645MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s),
io=80.0GiB (85.9GB), run=52210-52211msec
* roundrobin
READ: bw=1733MiB/s (1817MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s),
io=80.0GiB (85.9GB), run=47269-47271msec
- raid1c3 with 2 HDDs and 1 SSD:
2 x Segate Barracuda ST2000DM008 (2TB)
1 x Crucial CT256M550SSD1 (256GB)
* pid policy
READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s),
io=80.0GiB (85.9GB), run=44449-44450msec
* roundrobin
READ: bw=2485MiB/s (2605MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s),
io=80.0GiB (85.9GB), run=32969-32970msec
To measure the performance of each policy and find optimal penalty
values, I created scripts which are available here:
https://gitlab.com/vadorovsky/btrfs-perf
https://github.com/mrostecki/btrfs-perf
Michal Rostecki (6):
btrfs: Add inflight BIO request counter
btrfs: Store the last device I/O offset
btrfs: Add stripe_physical function
btrfs: Check if the filesystem is has mixed type of devices
btrfs: sysfs: Add directory for read policies
btrfs: Add roundrobin raid1 read policy
fs/btrfs/ctree.h | 3 +
fs/btrfs/disk-io.c | 3 +
fs/btrfs/sysfs.c | 144 ++++++++++++++++++++++++++----
fs/btrfs/volumes.c | 218 +++++++++++++++++++++++++++++++++++++++++++--
fs/btrfs/volumes.h | 22 +++++
5 files changed, 366 insertions(+), 24 deletions(-)
--
2.30.0
From: Michal Rostecki <[email protected]>
Add an atomic field which stores the physical offset of the last I/O
operation scheduled to the device. This information is going to be used
to measure the locality of I/O requests.
Signed-off-by: Michal Rostecki <[email protected]>
---
fs/btrfs/volumes.c | 4 ++++
fs/btrfs/volumes.h | 1 +
2 files changed, 5 insertions(+)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d4f452dcce95..292175206873 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -444,6 +444,7 @@ static struct btrfs_device *__alloc_device(struct btrfs_fs_info *fs_info)
kfree(dev);
return ERR_PTR(-ENOMEM);
}
+ atomic_set(&dev->last_offset, 0);
return dev;
}
@@ -6368,11 +6369,13 @@ static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
u64 physical, struct btrfs_device *dev)
{
struct btrfs_fs_info *fs_info = bbio->fs_info;
+ u64 length;
bio->bi_private = bbio;
btrfs_io_bio(bio)->device = dev;
bio->bi_end_io = btrfs_end_bio;
bio->bi_iter.bi_sector = physical >> 9;
+ length = bio->bi_iter.bi_size;
btrfs_debug_in_rcu(fs_info,
"btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u",
bio_op(bio), bio->bi_opf, bio->bi_iter.bi_sector,
@@ -6382,6 +6385,7 @@ static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
btrfs_bio_counter_inc_noblocked(fs_info);
percpu_counter_inc(&dev->inflight);
+ atomic_set(&dev->last_offset, physical + length);
btrfsic_submit_bio(bio);
}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 938c5292250c..6e544317a377 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -146,6 +146,7 @@ struct btrfs_device {
/* I/O stats for raid1 mirror selection */
struct percpu_counter inflight;
+ atomic_t last_offset;
};
/*
--
2.30.0
From: Michal Rostecki <[email protected]>
Add a per-CPU inflight BIO counter to btrfs_device which stores the
number of requests currently processed by the device. This information
is going to be used in roundrobin raid1 read policy.
Signed-off-by: Michal Rostecki <[email protected]>
---
fs/btrfs/volumes.c | 11 +++++++++--
fs/btrfs/volumes.h | 3 +++
2 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 3948f5b50d11..d4f452dcce95 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -376,6 +376,7 @@ void btrfs_free_device(struct btrfs_device *device)
extent_io_tree_release(&device->alloc_state);
bio_put(device->flush_bio);
btrfs_destroy_dev_zone_info(device);
+ percpu_counter_destroy(&device->inflight);
kfree(device);
}
@@ -439,6 +440,11 @@ static struct btrfs_device *__alloc_device(struct btrfs_fs_info *fs_info)
extent_io_tree_init(fs_info, &dev->alloc_state,
IO_TREE_DEVICE_ALLOC_STATE, NULL);
+ if (percpu_counter_init(&dev->inflight, 0, GFP_KERNEL)) {
+ kfree(dev);
+ return ERR_PTR(-ENOMEM);
+ }
+
return dev;
}
@@ -6305,6 +6311,7 @@ static inline void btrfs_end_bbio(struct btrfs_bio *bbio, struct bio *bio)
static void btrfs_end_bio(struct bio *bio)
{
+ struct btrfs_device *dev = btrfs_io_bio(bio)->device;
struct btrfs_bio *bbio = bio->bi_private;
int is_orig_bio = 0;
@@ -6312,8 +6319,6 @@ static void btrfs_end_bio(struct bio *bio)
atomic_inc(&bbio->error);
if (bio->bi_status == BLK_STS_IOERR ||
bio->bi_status == BLK_STS_TARGET) {
- struct btrfs_device *dev = btrfs_io_bio(bio)->device;
-
ASSERT(dev->bdev);
if (bio_op(bio) == REQ_OP_WRITE)
btrfs_dev_stat_inc_and_print(dev,
@@ -6331,6 +6336,7 @@ static void btrfs_end_bio(struct bio *bio)
is_orig_bio = 1;
btrfs_bio_counter_dec(bbio->fs_info);
+ percpu_counter_dec(&dev->inflight);
if (atomic_dec_and_test(&bbio->stripes_pending)) {
if (!is_orig_bio) {
@@ -6375,6 +6381,7 @@ static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
bio_set_dev(bio, dev->bdev);
btrfs_bio_counter_inc_noblocked(fs_info);
+ percpu_counter_inc(&dev->inflight);
btrfsic_submit_bio(bio);
}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 04e2b26823c2..938c5292250c 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -143,6 +143,9 @@ struct btrfs_device {
struct completion kobj_unregister;
/* For sysfs/FSID/devinfo/devid/ */
struct kobject devid_kobj;
+
+ /* I/O stats for raid1 mirror selection */
+ struct percpu_counter inflight;
};
/*
--
2.30.0
From: Michal Rostecki <[email protected]>
Add a new raid1 read policy `roundrobin`. For each read request, it
selects the mirror which has lower load than queue depth and it starts
iterating from the last used mirror (by the current CPU). Load is
defined as the number of inflight requests + a potential penalty value.
The policy can be enabled through sysfs:
# echo roundrobin > /sys/fs/btrfs/[fsid]/read_policies/policy
This policy was tested with fio and compared with the default `pid`
policy.
The singlethreaded test has the following parameters:
[global]
name=btrfs-raid1-seqread
filename=btrfs-raid1-seqread
rw=read
bs=64k
direct=0
numjobs=1
time_based=0
[file1]
size=10G
ioengine=libaio
and shows the following results:
- raid1c3 with 3 HDDs:
3 x Segate Barracuda ST2000DM008 (2TB)
* pid policy
READ: bw=217MiB/s (228MB/s), 217MiB/s-217MiB/s (228MB/s-228MB/s),
io=10.0GiB (10.7GB), run=47082-47082msec
* roundrobin policy
READ: bw=409MiB/s (429MB/s), 409MiB/s-409MiB/s (429MB/s-429MB/s),
io=10.0GiB (10.7GB), run=25028-25028mse
- raid1c3 with 2 HDDs and 1 SSD:
2 x Segate Barracuda ST2000DM008 (2TB)
1 x Crucial CT256M550SSD1 (256GB)
* pid policy (the worst case when only HDDs were chosen)
READ: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s),
io=10.0GiB (10.7GB), run=46577-46577mse
* pid policy (the best case when SSD was used as well)
READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s),
io=10.0GiB (10.7GB), run=19954-19954msec
* roundrobin (there are no noticeable differences when testing multiple
times)
READ: bw=541MiB/s (567MB/s), 541MiB/s-541MiB/s (567MB/s-567MB/s),
io=10.0GiB (10.7GB), run=18933-18933msec
The multithreaded test has the following parameters:
[global]
name=btrfs-raid1-seqread
filename=btrfs-raid1-seqread
rw=read
bs=64k
direct=0
numjobs=8
time_based=0
[file1]
size=10G
ioengine=libaio
and shows the following results:
- raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB)
3 x Segate Barracuda ST2000DM008 (2TB)
* pid policy
READ: bw=1569MiB/s (1645MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s),
io=80.0GiB (85.9GB), run=52210-52211msec
* roundrobin
READ: bw=1733MiB/s (1817MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s),
io=80.0GiB (85.9GB), run=47269-47271msec
- raid1c3 with 2 HDDs and 1 SSD:
2 x Segate Barracuda ST2000DM008 (2TB)
1 x Crucial CT256M550SSD1 (256GB)
* pid policy
READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s),
io=80.0GiB (85.9GB), run=44449-44450msec
* roundrobin
READ: bw=2485MiB/s (2605MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s),
io=80.0GiB (85.9GB), run=32969-32970msec
The penalty value is an additional value added to the number of inflight
requests when a scheduled request is non-local (which means it would
start from the different physical location than the physical location of
the last request processed by the given device). By default, it's
applied only in filesystems which have mixed types of devices
(non-rotational and rotational), but it can be configured to be applied
without that condition.
The configuration is done through sysfs:
- /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonlocal_inc_mixed_only
where 1 (the default) value means applying penalty only in mixed arrays,
0 means applying it unconditionally.
The exact penalty value is defined separately for non-rotational and
rotational devices. By default, it's 0 for non-rotational devices and 1
for rotational devices. Both values are configurable through sysfs:
- /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonrot_nonlocal_inc
- /sys/fs/btrfs/[fsid]/read_policies/roundrobin_rot_nonlocal_inc
To sum it up - the default case is applying the penalty under the
following conditions:
- the raid1 array consists of mixed types of devices
- the scheduled request is going to be non-local for the given disk
- the device is rotational
That default case is based on a slight preference towards non-rotational
disks in mixed arrays and has proven to give the best performance in
tested arrays.
For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s
(429MB/s) performance. Adding the penalty value 1 resulted in a
performance drop to 404MiB/s (424MB/s). Increasing the value towards 10
was making the performance even worse.
For the array with 2 HDDs and 1 SSD, adding penalty value 1 to
rotational disks resulted in the best performance - 541MiB/s (567MB/s).
Not adding any value and increasing the value was making the performance
worse.
Adding penalty value to non-rotational disks was always decreasing the
performance, which motivated setting it as 0 by default. For the purpose
of testing, it's still configurable.
To measure the performance of each policy and find optimal penalty
values, I created scripts which are available here:
https://gitlab.com/vadorovsky/btrfs-perf
https://github.com/mrostecki/btrfs-perf
Signed-off-by: Michal Rostecki <[email protected]>
---
fs/btrfs/ctree.h | 3 +
fs/btrfs/disk-io.c | 3 +
fs/btrfs/sysfs.c | 93 +++++++++++++++++++++++++++++-
fs/btrfs/volumes.c | 137 ++++++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.h | 10 ++++
5 files changed, 242 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a9b0521d9e89..6ff0a18fd219 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -976,6 +976,9 @@ struct btrfs_fs_info {
/* Max size to emit ZONE_APPEND write command */
u64 max_zone_append_size;
+ /* Last mirror picked in round-robin selection */
+ int __percpu *last_mirror;
+
#ifdef CONFIG_BTRFS_FS_REF_VERIFY
spinlock_t ref_verify_lock;
struct rb_root block_tree;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 71fab77873a5..937fcadbdd2f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1547,6 +1547,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
btrfs_extent_buffer_leak_debug_check(fs_info);
kfree(fs_info->super_copy);
kfree(fs_info->super_for_commit);
+ free_percpu(fs_info->last_mirror);
kvfree(fs_info);
}
@@ -2857,6 +2858,8 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
fs_info->swapfile_pins = RB_ROOT;
fs_info->send_in_progress = 0;
+
+ fs_info->last_mirror = __alloc_percpu(sizeof(int), __alignof__(int));
}
static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block *sb)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index a8f528eb4e50..b9a6d38843ef 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -928,7 +928,7 @@ static bool strmatch(const char *buffer, const char *string)
return false;
}
-static const char * const btrfs_read_policy_name[] = { "pid" };
+static const char * const btrfs_read_policy_name[] = { "pid", "roundrobin" };
static ssize_t btrfs_read_policy_show(struct kobject *kobj,
struct kobj_attribute *a, char *buf)
@@ -976,8 +976,99 @@ static ssize_t btrfs_read_policy_store(struct kobject *kobj,
}
BTRFS_ATTR_RW(read_policies, policy, btrfs_read_policy_show, btrfs_read_policy_store);
+static ssize_t btrfs_roundrobin_nonlocal_inc_mixed_only_show(
+ struct kobject *kobj, struct kobj_attribute *a, char *buf)
+{
+ struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj->parent);
+
+ return scnprintf(buf, PAGE_SIZE, "%d\n",
+ READ_ONCE(fs_devices->roundrobin_nonlocal_inc_mixed_only));
+}
+
+static ssize_t btrfs_roundrobin_nonlocal_inc_mixed_only_store(
+ struct kobject *kobj, struct kobj_attribute *a, const char *buf,
+ size_t len)
+{
+ struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj->parent);
+ bool val;
+ int ret;
+
+ ret = kstrtobool(buf, &val);
+ if (ret)
+ return -EINVAL;
+
+ WRITE_ONCE(fs_devices->roundrobin_nonlocal_inc_mixed_only, val);
+ return len;
+}
+BTRFS_ATTR_RW(read_policies, roundrobin_nonlocal_inc_mixed_only,
+ btrfs_roundrobin_nonlocal_inc_mixed_only_show,
+ btrfs_roundrobin_nonlocal_inc_mixed_only_store);
+
+static ssize_t btrfs_roundrobin_nonrot_nonlocal_inc_show(struct kobject *kobj,
+ struct kobj_attribute *a,
+ char *buf)
+{
+ struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj->parent);
+
+ return scnprintf(buf, PAGE_SIZE, "%d\n",
+ READ_ONCE(fs_devices->roundrobin_nonrot_nonlocal_inc));
+}
+
+static ssize_t btrfs_roundrobin_nonrot_nonlocal_inc_store(struct kobject *kobj,
+ struct kobj_attribute *a,
+ const char *buf,
+ size_t len)
+{
+ struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj->parent);
+ u32 val;
+ int ret;
+
+ ret = kstrtou32(buf, 10, &val);
+ if (ret)
+ return -EINVAL;
+
+ WRITE_ONCE(fs_devices->roundrobin_nonrot_nonlocal_inc, val);
+ return len;
+}
+BTRFS_ATTR_RW(read_policies, roundrobin_nonrot_nonlocal_inc,
+ btrfs_roundrobin_nonrot_nonlocal_inc_show,
+ btrfs_roundrobin_nonrot_nonlocal_inc_store);
+
+static ssize_t btrfs_roundrobin_rot_nonlocal_inc_show(struct kobject *kobj,
+ struct kobj_attribute *a,
+ char *buf)
+{
+ struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj->parent);
+
+ return scnprintf(buf, PAGE_SIZE, "%d\n",
+ READ_ONCE(fs_devices->roundrobin_rot_nonlocal_inc));
+}
+
+static ssize_t btrfs_roundrobin_rot_nonlocal_inc_store(struct kobject *kobj,
+ struct kobj_attribute *a,
+ const char *buf,
+ size_t len)
+{
+ struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj->parent);
+ u32 val;
+ int ret;
+
+ ret = kstrtou32(buf, 10, &val);
+ if (ret)
+ return -EINVAL;
+
+ WRITE_ONCE(fs_devices->roundrobin_rot_nonlocal_inc, val);
+ return len;
+}
+BTRFS_ATTR_RW(read_policies, roundrobin_rot_nonlocal_inc,
+ btrfs_roundrobin_rot_nonlocal_inc_show,
+ btrfs_roundrobin_rot_nonlocal_inc_store);
+
static const struct attribute *read_policies_attrs[] = {
BTRFS_ATTR_PTR(read_policies, policy),
+ BTRFS_ATTR_PTR(read_policies, roundrobin_nonlocal_inc_mixed_only),
+ BTRFS_ATTR_PTR(read_policies, roundrobin_nonrot_nonlocal_inc),
+ BTRFS_ATTR_PTR(read_policies, roundrobin_rot_nonlocal_inc),
NULL,
};
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1ad30a595722..c6dd393190f6 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1265,6 +1265,11 @@ static int open_fs_devices(struct btrfs_fs_devices *fs_devices,
fs_devices->total_rw_bytes = 0;
fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_REGULAR;
fs_devices->read_policy = BTRFS_READ_POLICY_PID;
+ fs_devices->roundrobin_nonlocal_inc_mixed_only = true;
+ fs_devices->roundrobin_nonrot_nonlocal_inc =
+ BTRFS_DEFAULT_ROUNDROBIN_NONROT_NONLOCAL_INC;
+ fs_devices->roundrobin_rot_nonlocal_inc =
+ BTRFS_DEFAULT_ROUNDROBIN_ROT_NONLOCAL_INC;
return 0;
}
@@ -5555,8 +5560,125 @@ static u64 stripe_physical(struct map_lookup *map, u32 stripe_index,
stripe_nr * map->stripe_len;
}
+/*
+ * Calculates the load of the given mirror. Load is defines as the number of
+ * inflight requests + potential penalty value.
+ *
+ * @fs_info: the filesystem
+ * @map: mapping containing the logical extent
+ * @mirror_index: number of mirror to check
+ * @stripe_offset: offset of the block in its stripe
+ * @stripe_nr: index of the stripe whete the block falls in
+ */
+static int mirror_load(struct btrfs_fs_info *fs_info, struct map_lookup *map,
+ int mirror_index, u64 stripe_offset, u64 stripe_nr)
+{
+ struct btrfs_fs_devices *fs_devices;
+ struct btrfs_device *dev;
+ int last_offset;
+ u64 physical;
+ int load;
+
+ dev = map->stripes[mirror_index].dev;
+ load = percpu_counter_sum(&dev->inflight);
+ last_offset = atomic_read(&dev->last_offset);
+ physical = stripe_physical(map, mirror_index, stripe_offset, stripe_nr);
+
+ fs_devices = fs_info->fs_devices;
+
+ /*
+ * If the filesystem has mixed type of devices (or we enable adding a
+ * penalty value regardless) and the request is non-local, add a
+ * penalty value.
+ */
+ if ((!fs_devices->roundrobin_nonlocal_inc_mixed_only ||
+ fs_devices->mixed) && last_offset != physical) {
+ if (dev->rotating)
+ return load + fs_devices->roundrobin_rot_nonlocal_inc;
+ return load + fs_devices->roundrobin_nonrot_nonlocal_inc;
+ }
+
+ return load;
+}
+
+/*
+ * Checks if the given mirror can process more requests.
+ *
+ * @fs_info: the filesystem
+ * @map: mapping containing the logical extent
+ * @mirror_index: index of the mirror to check
+ * @stripe_offset: offset of the block in its stripe
+ * @stripe_nr: index of the stripe whete the block falls in
+ *
+ * Returns true if more requests can be processes, otherwise returns false.
+ */
+static bool mirror_queue_not_filled(struct btrfs_fs_info *fs_info,
+ struct map_lookup *map, int mirror_index,
+ u64 stripe_offset, u64 stripe_nr)
+{
+ struct block_device *bdev;
+ unsigned int queue_depth;
+ int inflight;
+
+ bdev = map->stripes[mirror_index].dev->bdev;
+ inflight = mirror_load(fs_info, map, mirror_index, stripe_offset,
+ stripe_nr);
+ queue_depth = blk_queue_depth(bdev->bd_disk->queue);
+
+ return inflight < queue_depth;
+}
+
+/*
+ * Find a mirror using the round-robin technique which has lower load than
+ * queue depth. Load is defined as the number of inflight requests + potential
+ * penalty value.
+ *
+ * @fs_info: the filesystem
+ * @map: mapping containing the logical extent
+ * @first: index of the first device in the stripes array
+ * @num_stripes: number of stripes in the stripes array
+ * @stripe_offset: offset of the block in its stripe
+ * @stripe_nr: index of the stripe whete the block falls in
+ *
+ * Returns the index of selected mirror.
+ */
+static int find_live_mirror_roundrobin(struct btrfs_fs_info *fs_info,
+ struct map_lookup *map, int first,
+ int num_stripes, u64 stripe_offset,
+ u64 stripe_nr)
+{
+ int preferred_mirror;
+ int last_mirror;
+ int i;
+
+ last_mirror = this_cpu_read(*fs_info->last_mirror);
+
+ for (i = last_mirror; i < first + num_stripes; i++) {
+ if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
+ stripe_nr)) {
+ preferred_mirror = i;
+ goto out;
+ }
+ }
+
+ for (i = first; i < last_mirror; i++) {
+ if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
+ stripe_nr)) {
+ preferred_mirror = i;
+ goto out;
+ }
+ }
+
+ preferred_mirror = last_mirror;
+
+out:
+ this_cpu_write(*fs_info->last_mirror, preferred_mirror);
+ return preferred_mirror;
+}
+
static int find_live_mirror(struct btrfs_fs_info *fs_info,
struct map_lookup *map, int first,
+ u64 stripe_offset, u64 stripe_nr,
int dev_replace_is_ongoing)
{
int i;
@@ -5584,6 +5706,11 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
case BTRFS_READ_POLICY_PID:
preferred_mirror = first + (current->pid % num_stripes);
break;
+ case BTRFS_READ_POLICY_ROUNDROBIN:
+ preferred_mirror = find_live_mirror_roundrobin(
+ fs_info, map, first, num_stripes, stripe_offset,
+ stripe_nr);
+ break;
}
if (dev_replace_is_ongoing &&
@@ -6178,7 +6305,9 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
stripe_index = mirror_num - 1;
else {
stripe_index = find_live_mirror(fs_info, map, 0,
- dev_replace_is_ongoing);
+ stripe_offset,
+ stripe_nr,
+ dev_replace_is_ongoing);
mirror_num = stripe_index + 1;
}
@@ -6204,8 +6333,10 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
else {
int old_stripe_index = stripe_index;
stripe_index = find_live_mirror(fs_info, map,
- stripe_index,
- dev_replace_is_ongoing);
+ stripe_index,
+ stripe_offset,
+ stripe_nr,
+ dev_replace_is_ongoing);
mirror_num = stripe_index - old_stripe_index + 1;
}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index ee050fd48042..47ca47b60ea9 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -230,9 +230,15 @@ enum btrfs_chunk_allocation_policy {
enum btrfs_read_policy {
/* Use process PID to choose the stripe */
BTRFS_READ_POLICY_PID,
+ /* Round robin */
+ BTRFS_READ_POLICY_ROUNDROBIN,
BTRFS_NR_READ_POLICY,
};
+/* Default raid1 policies config */
+#define BTRFS_DEFAULT_ROUNDROBIN_NONROT_NONLOCAL_INC 0
+#define BTRFS_DEFAULT_ROUNDROBIN_ROT_NONLOCAL_INC 1
+
struct btrfs_fs_devices {
u8 fsid[BTRFS_FSID_SIZE]; /* FS specific uuid */
u8 metadata_uuid[BTRFS_FSID_SIZE];
@@ -294,6 +300,10 @@ struct btrfs_fs_devices {
/* Policy used to read the mirrored stripes */
enum btrfs_read_policy read_policy;
+ /* Policies config */
+ bool roundrobin_nonlocal_inc_mixed_only;
+ u32 roundrobin_nonrot_nonlocal_inc;
+ u32 roundrobin_rot_nonlocal_inc;
};
#define BTRFS_BIO_INLINE_CSUM_SIZE 64
--
2.30.0
From: Michal Rostecki <[email protected]>
Before this change, raid1 read policy could be selected by using the
/sys/fs/btrfs/[fsid]/read_policy file.
Change it to /sys/fs/btrfs/[fsid]/read_policies/policy.
The motivation behing creating the read_policies directory is that the
next changes and new read policies are going to intruduce settings
specific to read policies.
Signed-off-by: Michal Rostecki <[email protected]>
---
fs/btrfs/sysfs.c | 51 +++++++++++++++++++++++++++++++++-------------
fs/btrfs/volumes.h | 1 +
2 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 19b9fffa2c9c..a8f528eb4e50 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -896,6 +896,19 @@ static ssize_t btrfs_generation_show(struct kobject *kobj,
}
BTRFS_ATTR(, generation, btrfs_generation_show);
+static const struct attribute *btrfs_attrs[] = {
+ BTRFS_ATTR_PTR(, label),
+ BTRFS_ATTR_PTR(, nodesize),
+ BTRFS_ATTR_PTR(, sectorsize),
+ BTRFS_ATTR_PTR(, clone_alignment),
+ BTRFS_ATTR_PTR(, quota_override),
+ BTRFS_ATTR_PTR(, metadata_uuid),
+ BTRFS_ATTR_PTR(, checksum),
+ BTRFS_ATTR_PTR(, exclusive_operation),
+ BTRFS_ATTR_PTR(, generation),
+ NULL,
+};
+
/*
* Look for an exact string @string in @buffer with possible leading or
* trailing whitespace
@@ -920,7 +933,7 @@ static const char * const btrfs_read_policy_name[] = { "pid" };
static ssize_t btrfs_read_policy_show(struct kobject *kobj,
struct kobj_attribute *a, char *buf)
{
- struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj);
+ struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj->parent);
ssize_t ret = 0;
int i;
@@ -944,7 +957,7 @@ static ssize_t btrfs_read_policy_store(struct kobject *kobj,
struct kobj_attribute *a,
const char *buf, size_t len)
{
- struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj);
+ struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj->parent);
int i;
for (i = 0; i < BTRFS_NR_READ_POLICY; i++) {
@@ -961,19 +974,10 @@ static ssize_t btrfs_read_policy_store(struct kobject *kobj,
return -EINVAL;
}
-BTRFS_ATTR_RW(, read_policy, btrfs_read_policy_show, btrfs_read_policy_store);
+BTRFS_ATTR_RW(read_policies, policy, btrfs_read_policy_show, btrfs_read_policy_store);
-static const struct attribute *btrfs_attrs[] = {
- BTRFS_ATTR_PTR(, label),
- BTRFS_ATTR_PTR(, nodesize),
- BTRFS_ATTR_PTR(, sectorsize),
- BTRFS_ATTR_PTR(, clone_alignment),
- BTRFS_ATTR_PTR(, quota_override),
- BTRFS_ATTR_PTR(, metadata_uuid),
- BTRFS_ATTR_PTR(, checksum),
- BTRFS_ATTR_PTR(, exclusive_operation),
- BTRFS_ATTR_PTR(, generation),
- BTRFS_ATTR_PTR(, read_policy),
+static const struct attribute *read_policies_attrs[] = {
+ BTRFS_ATTR_PTR(read_policies, policy),
NULL,
};
@@ -1112,6 +1116,12 @@ void btrfs_sysfs_remove_mounted(struct btrfs_fs_info *fs_info)
sysfs_remove_link(fsid_kobj, "bdi");
+ if (fs_info->fs_devices->read_policies_kobj) {
+ sysfs_remove_files(fs_info->fs_devices->read_policies_kobj,
+ read_policies_attrs);
+ kobject_del(fs_info->fs_devices->read_policies_kobj);
+ kobject_put(fs_info->fs_devices->read_policies_kobj);
+ }
if (fs_info->space_info_kobj) {
sysfs_remove_files(fs_info->space_info_kobj, allocation_attrs);
kobject_del(fs_info->space_info_kobj);
@@ -1658,6 +1668,19 @@ int btrfs_sysfs_add_mounted(struct btrfs_fs_info *fs_info)
if (error)
goto failure;
+ fs_devs->read_policies_kobj = kobject_create_and_add("read_policies",
+ fsid_kobj);
+
+ if (!fs_devs->read_policies_kobj) {
+ error = -ENOMEM;
+ goto failure;
+ }
+
+ error = sysfs_create_files(fs_devs->read_policies_kobj,
+ read_policies_attrs);
+ if (error)
+ goto failure;
+
return 0;
failure:
btrfs_sysfs_remove_mounted(fs_info);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 594f1207281c..ee050fd48042 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -287,6 +287,7 @@ struct btrfs_fs_devices {
struct kobject fsid_kobj;
struct kobject *devices_kobj;
struct kobject *devinfo_kobj;
+ struct kobject *read_policies_kobj;
struct completion kobj_unregister;
enum btrfs_chunk_allocation_policy chunk_alloc_policy;
--
2.30.0
From: Michal Rostecki <[email protected]>
Move the calculation of the physical address for a stripe to the new
function - stripe_physical(). It can be used by raid1 read policies to
calculate the offset and select mirrors based on I/O locality.
Signed-off-by: Michal Rostecki <[email protected]>
---
fs/btrfs/volumes.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 292175206873..1ac364a2f105 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5498,6 +5498,23 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
return ret;
}
+/*
+ * Calculates the physical location for the given stripe and I/O geometry.
+ *
+ * @map: mapping containing the logical extent
+ * @stripe_index: index of the stripe to make a calculation for
+ * @stripe_offset: offset of the block in its stripe
+ * @stripe_nr: index of the stripe whete the block falls in
+ *
+ * Returns the physical location.
+ */
+static u64 stripe_physical(struct map_lookup *map, u32 stripe_index,
+ u64 stripe_offset, u64 stripe_nr)
+{
+ return map->stripes[stripe_index].physical + stripe_offset +
+ stripe_nr * map->stripe_len;
+}
+
static int find_live_mirror(struct btrfs_fs_info *fs_info,
struct map_lookup *map, int first,
int dev_replace_is_ongoing)
@@ -6216,8 +6233,9 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
}
for (i = 0; i < num_stripes; i++) {
- bbio->stripes[i].physical = map->stripes[stripe_index].physical +
- stripe_offset + stripe_nr * map->stripe_len;
+ bbio->stripes[i].physical = stripe_physical(map, stripe_index,
+ stripe_offset,
+ stripe_nr);
bbio->stripes[i].dev = map->stripes[stripe_index].dev;
stripe_index++;
}
--
2.30.0
From: Michal Rostecki <[email protected]>
Add the btrfs_check_mixed() function which checks if the filesystem has
the mixed type of devices (non-rotational and rotational). This
information is going to be used in roundrobin raid1 read policy.
Signed-off-by: Michal Rostecki <[email protected]>
---
fs/btrfs/volumes.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
fs/btrfs/volumes.h | 7 +++++++
2 files changed, 49 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1ac364a2f105..1ad30a595722 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -617,6 +617,35 @@ static int btrfs_free_stale_devices(const char *path,
return ret;
}
+/*
+ * Checks if after adding the new device the filesystem is going to have mixed
+ * types of devices (non-rotational and rotational).
+ *
+ * @fs_devices: list of devices
+ * @new_device_rotating: if the new device is rotational
+ *
+ * Returns true if there are mixed types of devices, otherwise returns false.
+ */
+static bool btrfs_check_mixed(struct btrfs_fs_devices *fs_devices,
+ bool new_device_rotating)
+{
+ struct btrfs_device *device, *prev_device;
+
+ list_for_each_entry(device, &fs_devices->devices, dev_list) {
+ if (prev_device == NULL &&
+ device->rotating != new_device_rotating)
+ return true;
+ if (prev_device != NULL &&
+ (device->rotating != prev_device->rotating ||
+ device->rotating != new_device_rotating))
+ return true;
+
+ prev_device = device;
+ }
+
+ return false;
+}
+
/*
* This is only used on mount, and we are protected from competing things
* messing with our fs_devices by the uuid_mutex, thus we do not need the
@@ -629,6 +658,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
struct request_queue *q;
struct block_device *bdev;
struct btrfs_super_block *disk_super;
+ bool rotating;
u64 devid;
int ret;
@@ -669,8 +699,12 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
}
q = bdev_get_queue(bdev);
- if (!blk_queue_nonrot(q))
+ rotating = !blk_queue_nonrot(q);
+ device->rotating = rotating;
+ if (rotating)
fs_devices->rotating = true;
+ if (!fs_devices->mixed)
+ fs_devices->mixed = btrfs_check_mixed(fs_devices, rotating);
device->bdev = bdev;
clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
@@ -2418,6 +2452,7 @@ static int btrfs_prepare_sprout(struct btrfs_fs_info *fs_info)
fs_devices->open_devices = 0;
fs_devices->missing_devices = 0;
fs_devices->rotating = false;
+ fs_devices->mixed = false;
list_add(&seed_devices->seed_list, &fs_devices->seed_list);
generate_random_uuid(fs_devices->fsid);
@@ -2522,6 +2557,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
int seeding_dev = 0;
int ret = 0;
bool locked = false;
+ bool rotating;
if (sb_rdonly(sb) && !fs_devices->seeding)
return -EROFS;
@@ -2621,8 +2657,12 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
atomic64_add(device->total_bytes, &fs_info->free_chunk_space);
- if (!blk_queue_nonrot(q))
+ rotating = !blk_queue_nonrot(q);
+ device->rotating = rotating;
+ if (rotating)
fs_devices->rotating = true;
+ if (!fs_devices->mixed)
+ fs_devices->mixed = btrfs_check_mixed(fs_devices, rotating);
orig_super_total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
btrfs_set_super_total_bytes(fs_info->super_copy,
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 6e544317a377..594f1207281c 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -147,6 +147,9 @@ struct btrfs_device {
/* I/O stats for raid1 mirror selection */
struct percpu_counter inflight;
atomic_t last_offset;
+
+ /* If the device is rotational */
+ bool rotating;
};
/*
@@ -274,6 +277,10 @@ struct btrfs_fs_devices {
* nonrot flag set
*/
bool rotating;
+ /* Set when we find or add both nonrot and rot disks in the
+ * filesystem
+ */
+ bool mixed;
struct btrfs_fs_info *fs_info;
/* sysfs kobjects */
--
2.30.0
On Tue, Feb 09, 2021 at 09:30:38PM +0100, Michal Rostecki wrote:
> From: Michal Rostecki <[email protected]>
>
> Add the btrfs_check_mixed() function which checks if the filesystem has
> the mixed type of devices (non-rotational and rotational). This
> information is going to be used in roundrobin raid1 read policy.a
[...]
> @@ -669,8 +699,12 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
> }
>
> q = bdev_get_queue(bdev);
> - if (!blk_queue_nonrot(q))
> + rotating = !blk_queue_nonrot(q);
> + device->rotating = rotating;
> + if (rotating)
> fs_devices->rotating = true;
> + if (!fs_devices->mixed)
> + fs_devices->mixed = btrfs_check_mixed(fs_devices, rotating);
[...]
Since this is adding to a set, a faster way is:
if (fs_devices->rotating != rotating)
fs_devices->mixed = true;
The scan might be necessary on device removal, though.
> - if (!blk_queue_nonrot(q))
> + rotating = !blk_queue_nonrot(q);
> + device->rotating = rotating;
> + if (rotating)
> fs_devices->rotating = true;
> + if (!fs_devices->mixed)
> + fs_devices->mixed = btrfs_check_mixed(fs_devices, rotating);
Duplication. Maybe pull all this into a function?
Best Regards,
Micha? Miros?aw
On Tue, Feb 09, 2021 at 09:30:40PM +0100, Michal Rostecki wrote:
[...]
> For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s
> (429MB/s) performance. Adding the penalty value 1 resulted in a
> performance drop to 404MiB/s (424MB/s). Increasing the value towards 10
> was making the performance even worse.
>
> For the array with 2 HDDs and 1 SSD, adding penalty value 1 to
> rotational disks resulted in the best performance - 541MiB/s (567MB/s).
> Not adding any value and increasing the value was making the performance
> worse.
>
> Adding penalty value to non-rotational disks was always decreasing the
> performance, which motivated setting it as 0 by default. For the purpose
> of testing, it's still configurable.
[...]
> + bdev = map->stripes[mirror_index].dev->bdev;
> + inflight = mirror_load(fs_info, map, mirror_index, stripe_offset,
> + stripe_nr);
> + queue_depth = blk_queue_depth(bdev->bd_disk->queue);
> +
> + return inflight < queue_depth;
[...]
> + last_mirror = this_cpu_read(*fs_info->last_mirror);
[...]
> + for (i = last_mirror; i < first + num_stripes; i++) {
> + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
> + stripe_nr)) {
> + preferred_mirror = i;
> + goto out;
> + }
> + }
> +
> + for (i = first; i < last_mirror; i++) {
> + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
> + stripe_nr)) {
> + preferred_mirror = i;
> + goto out;
> + }
> + }
> +
> + preferred_mirror = last_mirror;
> +
> +out:
> + this_cpu_write(*fs_info->last_mirror, preferred_mirror);
This looks like it effectively decreases queue depth for non-last
device. After all devices are filled to queue_depth-penalty, only
a single mirror will be selected for next reads (until a read on
some other one completes).
Have you tried testing with much more jobs / non-sequential accesses?
Best Reagrds,
Micha? Miros?aw
On 10/02/2021 04:30, Michal Rostecki wrote:
> From: Michal Rostecki <[email protected]>
>
> This patch series adds a new raid1 read policy - roundrobin. For each
> request, it selects the mirror which has lower load than queue depth.
> Load is defined as the number of inflight requests + a penalty value
> (if the scheduled request is not local to the last processed request for
> a rotational disk).
>
> The series consists of preparational changes which add necessary
> information to the btrfs_device struct and the change with the policy.
>
> This policy was tested with fio and compared with the default `pid`
> policy.
>
> The singlethreaded test has the following parameters:
>
> [global]
> name=btrfs-raid1-seqread
> filename=btrfs-raid1-seqread
> rw=read
> bs=64k
> direct=0
> numjobs=1
> time_based=0
>
> [file1]
> size=10G
> ioengine=libaio
>
> and shows the following results:
>
> - raid1c3 with 3 HDDs:
> 3 x Segate Barracuda ST2000DM008 (2TB)
> * pid policy
> READ: bw=217MiB/s (228MB/s), 217MiB/s-217MiB/s (228MB/s-228MB/s),
> io=10.0GiB (10.7GB), run=47082-47082msec
> * roundrobin policy
> READ: bw=409MiB/s (429MB/s), 409MiB/s-409MiB/s (429MB/s-429MB/s),
> io=10.0GiB (10.7GB), run=25028-25028mse
> - raid1c3 with 2 HDDs and 1 SSD:
> 2 x Segate Barracuda ST2000DM008 (2TB)
> 1 x Crucial CT256M550SSD1 (256GB)
> * pid policy (the worst case when only HDDs were chosen)
> READ: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s),
> io=10.0GiB (10.7GB), run=46577-46577mse
> * pid policy (the best case when SSD was used as well)
> READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s),
> io=10.0GiB (10.7GB), run=19954-19954msec
> * roundrobin (there are no noticeable differences when testing multiple
> times)
> READ: bw=541MiB/s (567MB/s), 541MiB/s-541MiB/s (567MB/s-567MB/s),
> io=10.0GiB (10.7GB), run=18933-18933msec
>
> The multithreaded test has the following parameters:
>
> [global]
> name=btrfs-raid1-seqread
> filename=btrfs-raid1-seqread
> rw=read
> bs=64k
> direct=0
> numjobs=8
> time_based=0
>
> [file1]
> size=10G
> ioengine=libaio
>
> and shows the following results:
>
> - raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB)
> 3 x Segate Barracuda ST2000DM008 (2TB)
> * pid policy
> READ: bw=1569MiB/s (1645MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s),
> io=80.0GiB (85.9GB), run=52210-52211msec
> * roundrobin
> READ: bw=1733MiB/s (1817MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s),
> io=80.0GiB (85.9GB), run=47269-47271msec
> - raid1c3 with 2 HDDs and 1 SSD:
> 2 x Segate Barracuda ST2000DM008 (2TB)
> 1 x Crucial CT256M550SSD1 (256GB)
> * pid policy
> READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s),
> io=80.0GiB (85.9GB), run=44449-44450msec
> * roundrobin
> READ: bw=2485MiB/s (2605MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s),
> io=80.0GiB (85.9GB), run=32969-32970msec
>
Both of the above test cases are sequential. How about some random IO
workload?
Also, the seek time for non-rotational devices does not exist. So it is
a good idea to test with ssd + nvme and all nvme or all ssd.
> To measure the performance of each policy and find optimal penalty
> values, I created scripts which are available here:
>
> https://gitlab.com/vadorovsky/btrfs-perf
> https://github.com/mrostecki/btrfs-perf
>
> Michal Rostecki (6):
> btrfs: Add inflight BIO request counter
> btrfs: Store the last device I/O offset
These patches look good. But as only round-robin policy requires
to monitor the inflight and last-offset. Could you bring them under
if policy=roundrobin? Otherwise, it is just a waste of CPU cycles
if the policy != roundrobin.
> btrfs: Add stripe_physical function
> btrfs: Check if the filesystem is has mixed type of devices
Thanks, Anand
> btrfs: sysfs: Add directory for read policies
> btrfs: Add roundrobin raid1 read policy
>
> fs/btrfs/ctree.h | 3 +
> fs/btrfs/disk-io.c | 3 +
> fs/btrfs/sysfs.c | 144 ++++++++++++++++++++++++++----
> fs/btrfs/volumes.c | 218 +++++++++++++++++++++++++++++++++++++++++++--
> fs/btrfs/volumes.h | 22 +++++
> 5 files changed, 366 insertions(+), 24 deletions(-)
>
On 10/02/2021 04:30, Michal Rostecki wrote:
> From: Michal Rostecki <[email protected]>
>
> Add a new raid1 read policy `roundrobin`. For each read request, it
> selects the mirror which has lower load than queue depth and it starts
> iterating from the last used mirror (by the current CPU). Load is
> defined as the number of inflight requests + a potential penalty value.
>
> The policy can be enabled through sysfs:
>
> # echo roundrobin > /sys/fs/btrfs/[fsid]/read_policies/policy
>
> This policy was tested with fio and compared with the default `pid`
> policy.
>
> The singlethreaded test has the following parameters:
>
> [global]
> name=btrfs-raid1-seqread
> filename=btrfs-raid1-seqread
> rw=read
> bs=64k
> direct=0
> numjobs=1
> time_based=0
>
> [file1]
> size=10G
> ioengine=libaio
>
> and shows the following results:
>
> - raid1c3 with 3 HDDs:
> 3 x Segate Barracuda ST2000DM008 (2TB)
> * pid policy
> READ: bw=217MiB/s (228MB/s), 217MiB/s-217MiB/s (228MB/s-228MB/s),
> io=10.0GiB (10.7GB), run=47082-47082msec
> * roundrobin policy
> READ: bw=409MiB/s (429MB/s), 409MiB/s-409MiB/s (429MB/s-429MB/s),
> io=10.0GiB (10.7GB), run=25028-25028mse
> - raid1c3 with 2 HDDs and 1 SSD:
> 2 x Segate Barracuda ST2000DM008 (2TB)
> 1 x Crucial CT256M550SSD1 (256GB)
> * pid policy (the worst case when only HDDs were chosen)
> READ: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s),
> io=10.0GiB (10.7GB), run=46577-46577mse
> * pid policy (the best case when SSD was used as well)
> READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s),
> io=10.0GiB (10.7GB), run=19954-19954msec
> * roundrobin (there are no noticeable differences when testing multiple
> times)
> READ: bw=541MiB/s (567MB/s), 541MiB/s-541MiB/s (567MB/s-567MB/s),
> io=10.0GiB (10.7GB), run=18933-18933msec
>
> The multithreaded test has the following parameters:
>
> [global]
> name=btrfs-raid1-seqread
> filename=btrfs-raid1-seqread
> rw=read
> bs=64k
> direct=0
> numjobs=8
> time_based=0
>
> [file1]
> size=10G
> ioengine=libaio
>
> and shows the following results:
>
> - raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB)
> 3 x Segate Barracuda ST2000DM008 (2TB)
> * pid policy
> READ: bw=1569MiB/s (1645MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s),
> io=80.0GiB (85.9GB), run=52210-52211msec
> * roundrobin
> READ: bw=1733MiB/s (1817MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s),
> io=80.0GiB (85.9GB), run=47269-47271msec
> - raid1c3 with 2 HDDs and 1 SSD:
> 2 x Segate Barracuda ST2000DM008 (2TB)
> 1 x Crucial CT256M550SSD1 (256GB)
> * pid policy
> READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s),
> io=80.0GiB (85.9GB), run=44449-44450msec
> * roundrobin
> READ: bw=2485MiB/s (2605MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s),
> io=80.0GiB (85.9GB), run=32969-32970msec
>
> The penalty value is an additional value added to the number of inflight
> requests when a scheduled request is non-local (which means it would
> start from the different physical location than the physical location of
> the last request processed by the given device). By default, it's
> applied only in filesystems which have mixed types of devices
> (non-rotational and rotational), but it can be configured to be applied
> without that condition.
>
> The configuration is done through sysfs:
> > - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonlocal_inc_mixed_only
>
> where 1 (the default) value means applying penalty only in mixed arrays,
> 0 means applying it unconditionally.
>
> The exact penalty value is defined separately for non-rotational and
> rotational devices. By default, it's 0 for non-rotational devices and 1
> for rotational devices. Both values are configurable through sysfs:
>
> - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonrot_nonlocal_inc
> - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_rot_nonlocal_inc
>
> To sum it up - the default case is applying the penalty under the
> following conditions:
>
> - the raid1 array consists of mixed types of devices
> - the scheduled request is going to be non-local for the given disk
> - the device is rotational
>
> That default case is based on a slight preference towards non-rotational
> disks in mixed arrays and has proven to give the best performance in
> tested arrays.
>> For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s
> (429MB/s) performance. Adding the penalty value 1 resulted in a
> performance drop to 404MiB/s (424MB/s). Increasing the value towards 10
> was making the performance even worse.
>
> For the array with 2 HDDs and 1 SSD, adding penalty value 1 to
> rotational disks resulted in the best performance - 541MiB/s (567MB/s).
> Not adding any value and increasing the value was making the performance
> worse.
> > Adding penalty value to non-rotational disks was always decreasing the
> performance, which motivated setting it as 0 by default. For the purpose
> of testing, it's still configurable.
>
> To measure the performance of each policy and find optimal penalty
> values, I created scripts which are available here:
>
So in summary
rotational + non-rotational: penalty = 1
all-rotational and homo : penalty = 0
all-non-rotational and homo: penalty = 0
I can't find any non-deterministic in your findings above.
It is not very clear to me if we need the configurable
parameters here.
It is better to have random workloads in the above three categories
of configs.
Apart from the above three configs, there is also
all-non-rotational with hetero
For example, ssd and nvme together both are non-rotational.
And,
all-rotational with hetero
For example, rotational disks with different speeds.
The inflight calculation is local to btrfs. If the device is busy due to
external factors, it would not switch to the better performing device.
Thanks, Anand
> https://gitlab.com/vadorovsky/btrfs-perf
> https://github.com/mrostecki/btrfs-perf
>
> Signed-off-by: Michal Rostecki <[email protected]>
> ---
> fs/btrfs/ctree.h | 3 +
> fs/btrfs/disk-io.c | 3 +
> fs/btrfs/sysfs.c | 93 +++++++++++++++++++++++++++++-
> fs/btrfs/volumes.c | 137 ++++++++++++++++++++++++++++++++++++++++++++-
> fs/btrfs/volumes.h | 10 ++++
> 5 files changed, 242 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index a9b0521d9e89..6ff0a18fd219 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -976,6 +976,9 @@ struct btrfs_fs_info {
> /* Max size to emit ZONE_APPEND write command */
> u64 max_zone_append_size;
>
> + /* Last mirror picked in round-robin selection */
> + int __percpu *last_mirror;
> +
> #ifdef CONFIG_BTRFS_FS_REF_VERIFY
> spinlock_t ref_verify_lock;
> struct rb_root block_tree;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 71fab77873a5..937fcadbdd2f 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1547,6 +1547,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
> btrfs_extent_buffer_leak_debug_check(fs_info);
> kfree(fs_info->super_copy);
> kfree(fs_info->super_for_commit);
> + free_percpu(fs_info->last_mirror);
> kvfree(fs_info);
> }
>
> @@ -2857,6 +2858,8 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
> fs_info->swapfile_pins = RB_ROOT;
>
> fs_info->send_in_progress = 0;
> +
> + fs_info->last_mirror = __alloc_percpu(sizeof(int), __alignof__(int));
> }
>
> static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block *sb)
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index a8f528eb4e50..b9a6d38843ef 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -928,7 +928,7 @@ static bool strmatch(const char *buffer, const char *string)
> return false;
> }
>
> -static const char * const btrfs_read_policy_name[] = { "pid" };
> +static const char * const btrfs_read_policy_name[] = { "pid", "roundrobin" };
>
> static ssize_t btrfs_read_policy_show(struct kobject *kobj,
> struct kobj_attribute *a, char *buf)
> @@ -976,8 +976,99 @@ static ssize_t btrfs_read_policy_store(struct kobject *kobj,
> }
> BTRFS_ATTR_RW(read_policies, policy, btrfs_read_policy_show, btrfs_read_policy_store);
>
> +static ssize_t btrfs_roundrobin_nonlocal_inc_mixed_only_show(
> + struct kobject *kobj, struct kobj_attribute *a, char *buf)
> +{
> + struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj->parent);
> +
> + return scnprintf(buf, PAGE_SIZE, "%d\n",
> + READ_ONCE(fs_devices->roundrobin_nonlocal_inc_mixed_only));
> +}
> +
> +static ssize_t btrfs_roundrobin_nonlocal_inc_mixed_only_store(
> + struct kobject *kobj, struct kobj_attribute *a, const char *buf,
> + size_t len)
> +{
> + struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj->parent);
> + bool val;
> + int ret;
> +
> + ret = kstrtobool(buf, &val);
> + if (ret)
> + return -EINVAL;
> +
> + WRITE_ONCE(fs_devices->roundrobin_nonlocal_inc_mixed_only, val);
> + return len;
> +}
> +BTRFS_ATTR_RW(read_policies, roundrobin_nonlocal_inc_mixed_only,
> + btrfs_roundrobin_nonlocal_inc_mixed_only_show,
> + btrfs_roundrobin_nonlocal_inc_mixed_only_store);
> +
> +static ssize_t btrfs_roundrobin_nonrot_nonlocal_inc_show(struct kobject *kobj,
> + struct kobj_attribute *a,
> + char *buf)
> +{
> + struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj->parent);
> +
> + return scnprintf(buf, PAGE_SIZE, "%d\n",
> + READ_ONCE(fs_devices->roundrobin_nonrot_nonlocal_inc));
> +}
> +
> +static ssize_t btrfs_roundrobin_nonrot_nonlocal_inc_store(struct kobject *kobj,
> + struct kobj_attribute *a,
> + const char *buf,
> + size_t len)
> +{
> + struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj->parent);
> + u32 val;
> + int ret;
> +
> + ret = kstrtou32(buf, 10, &val);
> + if (ret)
> + return -EINVAL;
> +
> + WRITE_ONCE(fs_devices->roundrobin_nonrot_nonlocal_inc, val);
> + return len;
> +}
> +BTRFS_ATTR_RW(read_policies, roundrobin_nonrot_nonlocal_inc,
> + btrfs_roundrobin_nonrot_nonlocal_inc_show,
> + btrfs_roundrobin_nonrot_nonlocal_inc_store);
> +
> +static ssize_t btrfs_roundrobin_rot_nonlocal_inc_show(struct kobject *kobj,
> + struct kobj_attribute *a,
> + char *buf)
> +{
> + struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj->parent);
> +
> + return scnprintf(buf, PAGE_SIZE, "%d\n",
> + READ_ONCE(fs_devices->roundrobin_rot_nonlocal_inc));
> +}
> +
> +static ssize_t btrfs_roundrobin_rot_nonlocal_inc_store(struct kobject *kobj,
> + struct kobj_attribute *a,
> + const char *buf,
> + size_t len)
> +{
> + struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj->parent);
> + u32 val;
> + int ret;
> +
> + ret = kstrtou32(buf, 10, &val);
> + if (ret)
> + return -EINVAL;
> +
> + WRITE_ONCE(fs_devices->roundrobin_rot_nonlocal_inc, val);
> + return len;
> +}
> +BTRFS_ATTR_RW(read_policies, roundrobin_rot_nonlocal_inc,
> + btrfs_roundrobin_rot_nonlocal_inc_show,
> + btrfs_roundrobin_rot_nonlocal_inc_store);
> +
> static const struct attribute *read_policies_attrs[] = {
> BTRFS_ATTR_PTR(read_policies, policy),
> + BTRFS_ATTR_PTR(read_policies, roundrobin_nonlocal_inc_mixed_only),
> + BTRFS_ATTR_PTR(read_policies, roundrobin_nonrot_nonlocal_inc),
> + BTRFS_ATTR_PTR(read_policies, roundrobin_rot_nonlocal_inc),
> NULL,
> };
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 1ad30a595722..c6dd393190f6 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1265,6 +1265,11 @@ static int open_fs_devices(struct btrfs_fs_devices *fs_devices,
> fs_devices->total_rw_bytes = 0;
> fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_REGULAR;
> fs_devices->read_policy = BTRFS_READ_POLICY_PID;
> + fs_devices->roundrobin_nonlocal_inc_mixed_only = true;
> + fs_devices->roundrobin_nonrot_nonlocal_inc =
> + BTRFS_DEFAULT_ROUNDROBIN_NONROT_NONLOCAL_INC;
> + fs_devices->roundrobin_rot_nonlocal_inc =
> + BTRFS_DEFAULT_ROUNDROBIN_ROT_NONLOCAL_INC;
>
> return 0;
> }
> @@ -5555,8 +5560,125 @@ static u64 stripe_physical(struct map_lookup *map, u32 stripe_index,
> stripe_nr * map->stripe_len;
> }
>
> +/*
> + * Calculates the load of the given mirror. Load is defines as the number of
> + * inflight requests + potential penalty value.
> + *
> + * @fs_info: the filesystem
> + * @map: mapping containing the logical extent
> + * @mirror_index: number of mirror to check
> + * @stripe_offset: offset of the block in its stripe
> + * @stripe_nr: index of the stripe whete the block falls in
> + */
> +static int mirror_load(struct btrfs_fs_info *fs_info, struct map_lookup *map,
> + int mirror_index, u64 stripe_offset, u64 stripe_nr)
> +{
> + struct btrfs_fs_devices *fs_devices;
> + struct btrfs_device *dev;
> + int last_offset;
> + u64 physical;
> + int load;
> +
> + dev = map->stripes[mirror_index].dev;
> + load = percpu_counter_sum(&dev->inflight);
> + last_offset = atomic_read(&dev->last_offset);
> + physical = stripe_physical(map, mirror_index, stripe_offset, stripe_nr);
> +
> + fs_devices = fs_info->fs_devices;
> +
> + /*
> + * If the filesystem has mixed type of devices (or we enable adding a
> + * penalty value regardless) and the request is non-local, add a
> + * penalty value.
> + */
> + if ((!fs_devices->roundrobin_nonlocal_inc_mixed_only ||
> + fs_devices->mixed) && last_offset != physical) {
> + if (dev->rotating)
> + return load + fs_devices->roundrobin_rot_nonlocal_inc;
> + return load + fs_devices->roundrobin_nonrot_nonlocal_inc;
> + }
> +
> + return load;
> +}
> +
> +/*
> + * Checks if the given mirror can process more requests.
> + *
> + * @fs_info: the filesystem
> + * @map: mapping containing the logical extent
> + * @mirror_index: index of the mirror to check
> + * @stripe_offset: offset of the block in its stripe
> + * @stripe_nr: index of the stripe whete the block falls in
> + *
> + * Returns true if more requests can be processes, otherwise returns false.
> + */
> +static bool mirror_queue_not_filled(struct btrfs_fs_info *fs_info,
> + struct map_lookup *map, int mirror_index,
> + u64 stripe_offset, u64 stripe_nr)
> +{
> + struct block_device *bdev;
> + unsigned int queue_depth;
> + int inflight;
> +
> + bdev = map->stripes[mirror_index].dev->bdev;
> + inflight = mirror_load(fs_info, map, mirror_index, stripe_offset,
> + stripe_nr);
> + queue_depth = blk_queue_depth(bdev->bd_disk->queue);
> +
> + return inflight < queue_depth;
> +}
> +
> +/*
> + * Find a mirror using the round-robin technique which has lower load than
> + * queue depth. Load is defined as the number of inflight requests + potential
> + * penalty value.
> + *
> + * @fs_info: the filesystem
> + * @map: mapping containing the logical extent
> + * @first: index of the first device in the stripes array
> + * @num_stripes: number of stripes in the stripes array
> + * @stripe_offset: offset of the block in its stripe
> + * @stripe_nr: index of the stripe whete the block falls in
> + *
> + * Returns the index of selected mirror.
> + */
> +static int find_live_mirror_roundrobin(struct btrfs_fs_info *fs_info,
> + struct map_lookup *map, int first,
> + int num_stripes, u64 stripe_offset,
> + u64 stripe_nr)
> +{
> + int preferred_mirror;
> + int last_mirror;
> + int i;
> +
> + last_mirror = this_cpu_read(*fs_info->last_mirror);
> +
> + for (i = last_mirror; i < first + num_stripes; i++) {
> + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
> + stripe_nr)) {
> + preferred_mirror = i;
> + goto out;
> + }
> + }
> +
> + for (i = first; i < last_mirror; i++) {
> + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
> + stripe_nr)) {
> + preferred_mirror = i;
> + goto out;
> + }
> + }
> +
> + preferred_mirror = last_mirror;
> +
> +out:
> + this_cpu_write(*fs_info->last_mirror, preferred_mirror);
> + return preferred_mirror;
> +}
> +
> static int find_live_mirror(struct btrfs_fs_info *fs_info,
> struct map_lookup *map, int first,
> + u64 stripe_offset, u64 stripe_nr,
> int dev_replace_is_ongoing)
> {
> int i;
> @@ -5584,6 +5706,11 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
> case BTRFS_READ_POLICY_PID:
> preferred_mirror = first + (current->pid % num_stripes);
> break;
> + case BTRFS_READ_POLICY_ROUNDROBIN:
> + preferred_mirror = find_live_mirror_roundrobin(
> + fs_info, map, first, num_stripes, stripe_offset,
> + stripe_nr);
> + break;
> }
>
> if (dev_replace_is_ongoing &&
> @@ -6178,7 +6305,9 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
> stripe_index = mirror_num - 1;
> else {
> stripe_index = find_live_mirror(fs_info, map, 0,
> - dev_replace_is_ongoing);
> + stripe_offset,
> + stripe_nr,
> + dev_replace_is_ongoing);
> mirror_num = stripe_index + 1;
> }
>
> @@ -6204,8 +6333,10 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
> else {
> int old_stripe_index = stripe_index;
> stripe_index = find_live_mirror(fs_info, map,
> - stripe_index,
> - dev_replace_is_ongoing);
> + stripe_index,
> + stripe_offset,
> + stripe_nr,
> + dev_replace_is_ongoing);
> mirror_num = stripe_index - old_stripe_index + 1;
> }
>
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index ee050fd48042..47ca47b60ea9 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -230,9 +230,15 @@ enum btrfs_chunk_allocation_policy {
> enum btrfs_read_policy {
> /* Use process PID to choose the stripe */
> BTRFS_READ_POLICY_PID,
> + /* Round robin */
> + BTRFS_READ_POLICY_ROUNDROBIN,
> BTRFS_NR_READ_POLICY,
> };
>
> +/* Default raid1 policies config */
> +#define BTRFS_DEFAULT_ROUNDROBIN_NONROT_NONLOCAL_INC 0
> +#define BTRFS_DEFAULT_ROUNDROBIN_ROT_NONLOCAL_INC 1
> +
> struct btrfs_fs_devices {
> u8 fsid[BTRFS_FSID_SIZE]; /* FS specific uuid */
> u8 metadata_uuid[BTRFS_FSID_SIZE];
> @@ -294,6 +300,10 @@ struct btrfs_fs_devices {
>
> /* Policy used to read the mirrored stripes */
> enum btrfs_read_policy read_policy;
> + /* Policies config */
> + bool roundrobin_nonlocal_inc_mixed_only;
> + u32 roundrobin_nonrot_nonlocal_inc;
> + u32 roundrobin_rot_nonlocal_inc;
> };
>
> #define BTRFS_BIO_INLINE_CSUM_SIZE 64
>
On Tue, Feb 9, 2021 at 9:32 PM Michal Rostecki <[email protected]> wrote:
>
> From: Michal Rostecki <[email protected]>
>
> Add the btrfs_check_mixed() function which checks if the filesystem has
> the mixed type of devices (non-rotational and rotational). This
> information is going to be used in roundrobin raid1 read policy.
>
> Signed-off-by: Michal Rostecki <[email protected]>
> ---
> fs/btrfs/volumes.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
> fs/btrfs/volumes.h | 7 +++++++
> 2 files changed, 49 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 1ac364a2f105..1ad30a595722 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -617,6 +617,35 @@ static int btrfs_free_stale_devices(const char *path,
> return ret;
> }
>
> +/*
> + * Checks if after adding the new device the filesystem is going to have mixed
> + * types of devices (non-rotational and rotational).
> + *
> + * @fs_devices: list of devices
> + * @new_device_rotating: if the new device is rotational
> + *
> + * Returns true if there are mixed types of devices, otherwise returns false.
> + */
> +static bool btrfs_check_mixed(struct btrfs_fs_devices *fs_devices,
> + bool new_device_rotating)
> +{
> + struct btrfs_device *device, *prev_device;
> +
> + list_for_each_entry(device, &fs_devices->devices, dev_list) {
> + if (prev_device == NULL &&
Hum, prev_device is not initialized when we enter the first iteration
of the loop.
> + device->rotating != new_device_rotating)
> + return true;
> + if (prev_device != NULL &&
> + (device->rotating != prev_device->rotating ||
Here it's more dangerous, dereferencing an uninitialized pointer can
result in a crash.
With this fixed, it would be better to redo the benchmarks when using
mixed device types.
Thanks.
> + device->rotating != new_device_rotating))
> + return true;
> +
> + prev_device = device;
> + }
> +
> + return false;
> +}
> +
> /*
> * This is only used on mount, and we are protected from competing things
> * messing with our fs_devices by the uuid_mutex, thus we do not need the
> @@ -629,6 +658,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
> struct request_queue *q;
> struct block_device *bdev;
> struct btrfs_super_block *disk_super;
> + bool rotating;
> u64 devid;
> int ret;
>
> @@ -669,8 +699,12 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
> }
>
> q = bdev_get_queue(bdev);
> - if (!blk_queue_nonrot(q))
> + rotating = !blk_queue_nonrot(q);
> + device->rotating = rotating;
> + if (rotating)
> fs_devices->rotating = true;
> + if (!fs_devices->mixed)
> + fs_devices->mixed = btrfs_check_mixed(fs_devices, rotating);
>
> device->bdev = bdev;
> clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
> @@ -2418,6 +2452,7 @@ static int btrfs_prepare_sprout(struct btrfs_fs_info *fs_info)
> fs_devices->open_devices = 0;
> fs_devices->missing_devices = 0;
> fs_devices->rotating = false;
> + fs_devices->mixed = false;
> list_add(&seed_devices->seed_list, &fs_devices->seed_list);
>
> generate_random_uuid(fs_devices->fsid);
> @@ -2522,6 +2557,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> int seeding_dev = 0;
> int ret = 0;
> bool locked = false;
> + bool rotating;
>
> if (sb_rdonly(sb) && !fs_devices->seeding)
> return -EROFS;
> @@ -2621,8 +2657,12 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>
> atomic64_add(device->total_bytes, &fs_info->free_chunk_space);
>
> - if (!blk_queue_nonrot(q))
> + rotating = !blk_queue_nonrot(q);
> + device->rotating = rotating;
> + if (rotating)
> fs_devices->rotating = true;
> + if (!fs_devices->mixed)
> + fs_devices->mixed = btrfs_check_mixed(fs_devices, rotating);
>
> orig_super_total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
> btrfs_set_super_total_bytes(fs_info->super_copy,
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 6e544317a377..594f1207281c 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -147,6 +147,9 @@ struct btrfs_device {
> /* I/O stats for raid1 mirror selection */
> struct percpu_counter inflight;
> atomic_t last_offset;
> +
> + /* If the device is rotational */
> + bool rotating;
> };
>
> /*
> @@ -274,6 +277,10 @@ struct btrfs_fs_devices {
> * nonrot flag set
> */
> bool rotating;
> + /* Set when we find or add both nonrot and rot disks in the
> + * filesystem
> + */
> + bool mixed;
>
> struct btrfs_fs_info *fs_info;
> /* sysfs kobjects */
> --
> 2.30.0
>
--
Filipe David Manana,
“Whether you think you can, or you think you can't — you're right.”
On Wed, Feb 10, 2021 at 02:52:01PM +0800, Anand Jain wrote:
> On 10/02/2021 04:30, Michal Rostecki wrote:
> > From: Michal Rostecki <[email protected]>
> >
> > This patch series adds a new raid1 read policy - roundrobin. For each
> > request, it selects the mirror which has lower load than queue depth.
> > Load is defined as the number of inflight requests + a penalty value
> > (if the scheduled request is not local to the last processed request for
> > a rotational disk).
> >
> > The series consists of preparational changes which add necessary
> > information to the btrfs_device struct and the change with the policy.
> >
> > This policy was tested with fio and compared with the default `pid`
> > policy.
> >
> > The singlethreaded test has the following parameters:
> >
> > [global]
> > name=btrfs-raid1-seqread
> > filename=btrfs-raid1-seqread
> > rw=read
> > bs=64k
> > direct=0
> > numjobs=1
> > time_based=0
> >
> > [file1]
> > size=10G
> > ioengine=libaio
> >
> > and shows the following results:
> >
> > - raid1c3 with 3 HDDs:
> > 3 x Segate Barracuda ST2000DM008 (2TB)
> > * pid policy
> > READ: bw=217MiB/s (228MB/s), 217MiB/s-217MiB/s (228MB/s-228MB/s),
> > io=10.0GiB (10.7GB), run=47082-47082msec
> > * roundrobin policy
> > READ: bw=409MiB/s (429MB/s), 409MiB/s-409MiB/s (429MB/s-429MB/s),
> > io=10.0GiB (10.7GB), run=25028-25028mse
>
>
> > - raid1c3 with 2 HDDs and 1 SSD:
> > 2 x Segate Barracuda ST2000DM008 (2TB)
> > 1 x Crucial CT256M550SSD1 (256GB)
> > * pid policy (the worst case when only HDDs were chosen)
> > READ: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s),
> > io=10.0GiB (10.7GB), run=46577-46577mse
> > * pid policy (the best case when SSD was used as well)
> > READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s),
> > io=10.0GiB (10.7GB), run=19954-19954msec
> > * roundrobin (there are no noticeable differences when testing multiple
> > times)
> > READ: bw=541MiB/s (567MB/s), 541MiB/s-541MiB/s (567MB/s-567MB/s),
> > io=10.0GiB (10.7GB), run=18933-18933msec
> >
> > The multithreaded test has the following parameters:
> >
> > [global]
> > name=btrfs-raid1-seqread
> > filename=btrfs-raid1-seqread
> > rw=read
> > bs=64k
> > direct=0
> > numjobs=8
> > time_based=0
> >
> > [file1]
> > size=10G
> > ioengine=libaio
> >
> > and shows the following results:
> >
> > - raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB)
> > 3 x Segate Barracuda ST2000DM008 (2TB)
> > * pid policy
> > READ: bw=1569MiB/s (1645MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s),
> > io=80.0GiB (85.9GB), run=52210-52211msec
> > * roundrobin
> > READ: bw=1733MiB/s (1817MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s),
> > io=80.0GiB (85.9GB), run=47269-47271msec
> > - raid1c3 with 2 HDDs and 1 SSD:
> > 2 x Segate Barracuda ST2000DM008 (2TB)
> > 1 x Crucial CT256M550SSD1 (256GB)
> > * pid policy
> > READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s),
> > io=80.0GiB (85.9GB), run=44449-44450msec
> > * roundrobin
> > READ: bw=2485MiB/s (2605MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s),
> > io=80.0GiB (85.9GB), run=32969-32970msec
> >
>
> Both of the above test cases are sequential. How about some random IO
> workload?
>
> Also, the seek time for non-rotational devices does not exist. So it is
> a good idea to test with ssd + nvme and all nvme or all ssd.
>
Good idea. I will test random I/O and will try to test all-nvme /
all-ssd and mixed nonrot.
> > To measure the performance of each policy and find optimal penalty
> > values, I created scripts which are available here:
> >
> > https://gitlab.com/vadorovsky/btrfs-perf
> > https://github.com/mrostecki/btrfs-perf
> >
> > Michal Rostecki (6):
>
>
> > btrfs: Add inflight BIO request counter
> > btrfs: Store the last device I/O offset
>
> These patches look good. But as only round-robin policy requires
> to monitor the inflight and last-offset. Could you bring them under
> if policy=roundrobin? Otherwise, it is just a waste of CPU cycles
> if the policy != roundrobin.
>
If I bring those stats under if policy=roundrobin, they are going to be
inaccurate if someone switches policies on the running system, after
doing any I/O in that filesystem.
I'm open to suggestions how can I make those stats as lightweight as
possible. Unfortunately, I don't think I can store the last physical
location without atomic_t.
The BIO percpu counter is probably the least to be worried about, though
I could maybe get rid of it entirely in favor of using part_stat_read().
> > btrfs: Add stripe_physical function
> > btrfs: Check if the filesystem is has mixed type of devices
>
> Thanks, Anand
>
> > btrfs: sysfs: Add directory for read policies
> > btrfs: Add roundrobin raid1 read policy
> >
> > fs/btrfs/ctree.h | 3 +
> > fs/btrfs/disk-io.c | 3 +
> > fs/btrfs/sysfs.c | 144 ++++++++++++++++++++++++++----
> > fs/btrfs/volumes.c | 218 +++++++++++++++++++++++++++++++++++++++++++--
> > fs/btrfs/volumes.h | 22 +++++
> > 5 files changed, 366 insertions(+), 24 deletions(-)
> >
>
Thanks for review,
Michal
On Wed, Feb 10, 2021 at 05:24:28AM +0100, Michał Mirosław wrote:
> On Tue, Feb 09, 2021 at 09:30:40PM +0100, Michal Rostecki wrote:
> [...]
> > For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s
> > (429MB/s) performance. Adding the penalty value 1 resulted in a
> > performance drop to 404MiB/s (424MB/s). Increasing the value towards 10
> > was making the performance even worse.
> >
> > For the array with 2 HDDs and 1 SSD, adding penalty value 1 to
> > rotational disks resulted in the best performance - 541MiB/s (567MB/s).
> > Not adding any value and increasing the value was making the performance
> > worse.
> >
> > Adding penalty value to non-rotational disks was always decreasing the
> > performance, which motivated setting it as 0 by default. For the purpose
> > of testing, it's still configurable.
> [...]
> > + bdev = map->stripes[mirror_index].dev->bdev;
> > + inflight = mirror_load(fs_info, map, mirror_index, stripe_offset,
> > + stripe_nr);
> > + queue_depth = blk_queue_depth(bdev->bd_disk->queue);
> > +
> > + return inflight < queue_depth;
> [...]
> > + last_mirror = this_cpu_read(*fs_info->last_mirror);
> [...]
> > + for (i = last_mirror; i < first + num_stripes; i++) {
> > + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
> > + stripe_nr)) {
> > + preferred_mirror = i;
> > + goto out;
> > + }
> > + }
> > +
> > + for (i = first; i < last_mirror; i++) {
> > + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
> > + stripe_nr)) {
> > + preferred_mirror = i;
> > + goto out;
> > + }
> > + }
> > +
> > + preferred_mirror = last_mirror;
> > +
> > +out:
> > + this_cpu_write(*fs_info->last_mirror, preferred_mirror);
>
> This looks like it effectively decreases queue depth for non-last
> device. After all devices are filled to queue_depth-penalty, only
> a single mirror will be selected for next reads (until a read on
> some other one completes).
>
Good point. And if all devices are going to be filled for longer time,
this function will keep selecting the last one. Maybe I should select
last+1 in that case. Would that address your concern or did you have any
other solution in mind?
Thanks for pointing that out.
> Have you tried testing with much more jobs / non-sequential accesses?
>
I didn't try with non-sequential accesses. Will do that before
respinning v2.
> Best Reagrds,
> Michał Mirosław
Regards,
Michal
On Wed, Feb 10, 2021 at 05:08:05AM +0100, Michał Mirosław wrote:
> On Tue, Feb 09, 2021 at 09:30:38PM +0100, Michal Rostecki wrote:
> > From: Michal Rostecki <[email protected]>
> >
> > Add the btrfs_check_mixed() function which checks if the filesystem has
> > the mixed type of devices (non-rotational and rotational). This
> > information is going to be used in roundrobin raid1 read policy.a
> [...]
> > @@ -669,8 +699,12 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
> > }
> >
> > q = bdev_get_queue(bdev);
> > - if (!blk_queue_nonrot(q))
> > + rotating = !blk_queue_nonrot(q);
> > + device->rotating = rotating;
> > + if (rotating)
> > fs_devices->rotating = true;
> > + if (!fs_devices->mixed)
> > + fs_devices->mixed = btrfs_check_mixed(fs_devices, rotating);
> [...]
>
> Since this is adding to a set, a faster way is:
>
> if (fs_devices->rotating != rotating)
> fs_devices->mixed = true;
>
Good idea.
> The scan might be necessary on device removal, though.
>
And good point. I didn't address the case of device removal at all.
> > - if (!blk_queue_nonrot(q))
> > + rotating = !blk_queue_nonrot(q);
> > + device->rotating = rotating;
> > + if (rotating)
> > fs_devices->rotating = true;
> > + if (!fs_devices->mixed)
> > + fs_devices->mixed = btrfs_check_mixed(fs_devices, rotating);
>
> Duplication. Maybe pull all this into a function?
>
Will do that as well.
> Best Regards,
> Michał Mirosław
Regards,
Michal
On Wed, Feb 10, 2021 at 10:09:10AM +0000, Filipe Manana wrote:
> On Tue, Feb 9, 2021 at 9:32 PM Michal Rostecki <[email protected]> wrote:
> >
> > From: Michal Rostecki <[email protected]>
> >
> > Add the btrfs_check_mixed() function which checks if the filesystem has
> > the mixed type of devices (non-rotational and rotational). This
> > information is going to be used in roundrobin raid1 read policy.
> >
> > Signed-off-by: Michal Rostecki <[email protected]>
> > ---
> > fs/btrfs/volumes.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
> > fs/btrfs/volumes.h | 7 +++++++
> > 2 files changed, 49 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 1ac364a2f105..1ad30a595722 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -617,6 +617,35 @@ static int btrfs_free_stale_devices(const char *path,
> > return ret;
> > }
> >
> > +/*
> > + * Checks if after adding the new device the filesystem is going to have mixed
> > + * types of devices (non-rotational and rotational).
> > + *
> > + * @fs_devices: list of devices
> > + * @new_device_rotating: if the new device is rotational
> > + *
> > + * Returns true if there are mixed types of devices, otherwise returns false.
> > + */
> > +static bool btrfs_check_mixed(struct btrfs_fs_devices *fs_devices,
> > + bool new_device_rotating)
> > +{
> > + struct btrfs_device *device, *prev_device;
> > +
> > + list_for_each_entry(device, &fs_devices->devices, dev_list) {
> > + if (prev_device == NULL &&
>
> Hum, prev_device is not initialized when we enter the first iteration
> of the loop.
>
> > + device->rotating != new_device_rotating)
> > + return true;
> > + if (prev_device != NULL &&
> > + (device->rotating != prev_device->rotating ||
>
> Here it's more dangerous, dereferencing an uninitialized pointer can
> result in a crash.
>
> With this fixed, it would be better to redo the benchmarks when using
> mixed device types.
>
> Thanks.
>
Thanks for pointing that out. Will fix and redo benchmarks for v2.
> > + device->rotating != new_device_rotating))
> > + return true;
> > +
> > + prev_device = device;
> > + }
> > +
> > + return false;
> > +}
> > +
> > /*
> > * This is only used on mount, and we are protected from competing things
> > * messing with our fs_devices by the uuid_mutex, thus we do not need the
> > @@ -629,6 +658,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
> > struct request_queue *q;
> > struct block_device *bdev;
> > struct btrfs_super_block *disk_super;
> > + bool rotating;
> > u64 devid;
> > int ret;
> >
> > @@ -669,8 +699,12 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
> > }
> >
> > q = bdev_get_queue(bdev);
> > - if (!blk_queue_nonrot(q))
> > + rotating = !blk_queue_nonrot(q);
> > + device->rotating = rotating;
> > + if (rotating)
> > fs_devices->rotating = true;
> > + if (!fs_devices->mixed)
> > + fs_devices->mixed = btrfs_check_mixed(fs_devices, rotating);
> >
> > device->bdev = bdev;
> > clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
> > @@ -2418,6 +2452,7 @@ static int btrfs_prepare_sprout(struct btrfs_fs_info *fs_info)
> > fs_devices->open_devices = 0;
> > fs_devices->missing_devices = 0;
> > fs_devices->rotating = false;
> > + fs_devices->mixed = false;
> > list_add(&seed_devices->seed_list, &fs_devices->seed_list);
> >
> > generate_random_uuid(fs_devices->fsid);
> > @@ -2522,6 +2557,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> > int seeding_dev = 0;
> > int ret = 0;
> > bool locked = false;
> > + bool rotating;
> >
> > if (sb_rdonly(sb) && !fs_devices->seeding)
> > return -EROFS;
> > @@ -2621,8 +2657,12 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> >
> > atomic64_add(device->total_bytes, &fs_info->free_chunk_space);
> >
> > - if (!blk_queue_nonrot(q))
> > + rotating = !blk_queue_nonrot(q);
> > + device->rotating = rotating;
> > + if (rotating)
> > fs_devices->rotating = true;
> > + if (!fs_devices->mixed)
> > + fs_devices->mixed = btrfs_check_mixed(fs_devices, rotating);
> >
> > orig_super_total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
> > btrfs_set_super_total_bytes(fs_info->super_copy,
> > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> > index 6e544317a377..594f1207281c 100644
> > --- a/fs/btrfs/volumes.h
> > +++ b/fs/btrfs/volumes.h
> > @@ -147,6 +147,9 @@ struct btrfs_device {
> > /* I/O stats for raid1 mirror selection */
> > struct percpu_counter inflight;
> > atomic_t last_offset;
> > +
> > + /* If the device is rotational */
> > + bool rotating;
> > };
> >
> > /*
> > @@ -274,6 +277,10 @@ struct btrfs_fs_devices {
> > * nonrot flag set
> > */
> > bool rotating;
> > + /* Set when we find or add both nonrot and rot disks in the
> > + * filesystem
> > + */
> > + bool mixed;
> >
> > struct btrfs_fs_info *fs_info;
> > /* sysfs kobjects */
> > --
> > 2.30.0
> >
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”
On Wed, Feb 10, 2021 at 12:29:25PM +0000, Michal Rostecki wrote:
> On Wed, Feb 10, 2021 at 05:24:28AM +0100, Micha? Miros?aw wrote:
> > On Tue, Feb 09, 2021 at 09:30:40PM +0100, Michal Rostecki wrote:
> > [...]
> > > For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s
> > > (429MB/s) performance. Adding the penalty value 1 resulted in a
> > > performance drop to 404MiB/s (424MB/s). Increasing the value towards 10
> > > was making the performance even worse.
> > >
> > > For the array with 2 HDDs and 1 SSD, adding penalty value 1 to
> > > rotational disks resulted in the best performance - 541MiB/s (567MB/s).
> > > Not adding any value and increasing the value was making the performance
> > > worse.
> > >
> > > Adding penalty value to non-rotational disks was always decreasing the
> > > performance, which motivated setting it as 0 by default. For the purpose
> > > of testing, it's still configurable.
> > [...]
> > > + bdev = map->stripes[mirror_index].dev->bdev;
> > > + inflight = mirror_load(fs_info, map, mirror_index, stripe_offset,
> > > + stripe_nr);
> > > + queue_depth = blk_queue_depth(bdev->bd_disk->queue);
> > > +
> > > + return inflight < queue_depth;
> > [...]
> > > + last_mirror = this_cpu_read(*fs_info->last_mirror);
> > [...]
> > > + for (i = last_mirror; i < first + num_stripes; i++) {
> > > + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
> > > + stripe_nr)) {
> > > + preferred_mirror = i;
> > > + goto out;
> > > + }
> > > + }
> > > +
> > > + for (i = first; i < last_mirror; i++) {
> > > + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
> > > + stripe_nr)) {
> > > + preferred_mirror = i;
> > > + goto out;
> > > + }
> > > + }
> > > +
> > > + preferred_mirror = last_mirror;
> > > +
> > > +out:
> > > + this_cpu_write(*fs_info->last_mirror, preferred_mirror);
> >
> > This looks like it effectively decreases queue depth for non-last
> > device. After all devices are filled to queue_depth-penalty, only
> > a single mirror will be selected for next reads (until a read on
> > some other one completes).
> >
>
> Good point. And if all devices are going to be filled for longer time,
> this function will keep selecting the last one. Maybe I should select
> last+1 in that case. Would that address your concern or did you have any
> other solution in mind?
The best would be to postpone the selection until one device becomes free
again. But if that's not doable, then yes, you could make sure it stays
round-robin after filling the queues (the scheduling will loose the
"penalty"-driven adjustment though).
Best Reagrds,
Micha? Miros?aw
On Wed, Feb 10, 2021 at 12:18:53PM +0000, Michal Rostecki wrote:
> > These patches look good. But as only round-robin policy requires
> > to monitor the inflight and last-offset. Could you bring them under
> > if policy=roundrobin? Otherwise, it is just a waste of CPU cycles
> > if the policy != roundrobin.
> >
>
> If I bring those stats under if policy=roundrobin, they are going to be
> inaccurate if someone switches policies on the running system, after
> doing any I/O in that filesystem.
>
> I'm open to suggestions how can I make those stats as lightweight as
> possible. Unfortunately, I don't think I can store the last physical
> location without atomic_t.
>
> The BIO percpu counter is probably the least to be worried about, though
> I could maybe get rid of it entirely in favor of using part_stat_read().
>
Actually, after thinking about that more, I'm wondering if I should just
drop the last-offset stat and penalty mechanism entirely. They seem to
improve performance slightly only on mixed workloads (thought I need to
check if that's the case in all-SDD od all NVMe arrays), but still
perform worse than policies that you proposed.
Maybe it'd be better if I just focus on getting the best performance on
non-mixed environments in my policy and thus stick to the simple
`inflight < queue_depth` check...
Thanks,
Michal
On Wed, Feb 10, 2021 at 01:58:15PM +0100, Michał Mirosław wrote:
> On Wed, Feb 10, 2021 at 12:29:25PM +0000, Michal Rostecki wrote:
> > On Wed, Feb 10, 2021 at 05:24:28AM +0100, Michał Mirosław wrote:
> > > This looks like it effectively decreases queue depth for non-last
> > > device. After all devices are filled to queue_depth-penalty, only
> > > a single mirror will be selected for next reads (until a read on
> > > some other one completes).
> > >
> >
> > Good point. And if all devices are going to be filled for longer time,
> > this function will keep selecting the last one. Maybe I should select
> > last+1 in that case. Would that address your concern or did you have any
> > other solution in mind?
>
> The best would be to postpone the selection until one device becomes free
> again. But if that's not doable, then yes, you could make sure it stays
> round-robin after filling the queues (the scheduling will loose the
> "penalty"-driven adjustment though).
Or another idea - when all the queues are filled, return the mirror
which has the lowest load (inflight + penalty), even though it's greater
than queue depth. In that case the scheduling will not lose the penalty
adjustment and the load is going to be spreaded more fair.
I'm not sure if postponing the selection is that good idea. I think it's
better if the request is added to the iosched queue anyway, even if the
disks' queues are filled, and let the I/O scheduler handle that. The
length of the iosched queue (nr_requests, attribute of the iosched) is
usually greater than queue depth (attribute of the devide), which means
that it's fine to schedule more requests for iosched to handle.
IMO btrfs should use the information given by iosched only for heuristic
mirror selection, rather than implement its own throttling logic.
Does it make sense to you?
An another idea could be an additional iteration in regard to
nr_requests, if all load values are greater than queue depths, though it
might be an overkill. I would prefer to stick to my first idea if
everyone agrees.
Regards,
Michal
On Wed, Feb 10, 2021 at 07:23:04PM +0000, Michal Rostecki wrote:
> On Wed, Feb 10, 2021 at 01:58:15PM +0100, Micha? Miros?aw wrote:
> > On Wed, Feb 10, 2021 at 12:29:25PM +0000, Michal Rostecki wrote:
> > > On Wed, Feb 10, 2021 at 05:24:28AM +0100, Micha? Miros?aw wrote:
> > > > This looks like it effectively decreases queue depth for non-last
> > > > device. After all devices are filled to queue_depth-penalty, only
> > > > a single mirror will be selected for next reads (until a read on
> > > > some other one completes).
> > > >
> > >
> > > Good point. And if all devices are going to be filled for longer time,
> > > this function will keep selecting the last one. Maybe I should select
> > > last+1 in that case. Would that address your concern or did you have any
> > > other solution in mind?
> >
> > The best would be to postpone the selection until one device becomes free
> > again. But if that's not doable, then yes, you could make sure it stays
> > round-robin after filling the queues (the scheduling will loose the
> > "penalty"-driven adjustment though).
>
> Or another idea - when all the queues are filled, return the mirror
> which has the lowest load (inflight + penalty), even though it's greater
> than queue depth. In that case the scheduling will not lose the penalty
> adjustment and the load is going to be spreaded more fair.
>
> I'm not sure if postponing the selection is that good idea. I think it's
> better if the request is added to the iosched queue anyway, even if the
> disks' queues are filled, and let the I/O scheduler handle that. The
> length of the iosched queue (nr_requests, attribute of the iosched) is
> usually greater than queue depth (attribute of the devide), which means
> that it's fine to schedule more requests for iosched to handle.
>
> IMO btrfs should use the information given by iosched only for heuristic
> mirror selection, rather than implement its own throttling logic.
>
> Does it make sense to you?
>
> An another idea could be an additional iteration in regard to
> nr_requests, if all load values are greater than queue depths, though it
> might be an overkill. I would prefer to stick to my first idea if
> everyone agrees.
What if iosched could provide an estimate of request's latency? Then
btrfs could always select the lowest. For reads from NVME/SSD I would
normally expect something simple: speed_factor * (pending_bytes + req_bytes).
For HDDs this could do more computation like looking into what is there
in the queue already.
This would deviate from simple round-robin scheme, though.
Best Regards
Micha? Miros?aw
On Thu, Feb 11, 2021 at 03:27:38AM +0100, Michał Mirosław wrote:
> On Wed, Feb 10, 2021 at 07:23:04PM +0000, Michal Rostecki wrote:
> > On Wed, Feb 10, 2021 at 01:58:15PM +0100, Michał Mirosław wrote:
> > > On Wed, Feb 10, 2021 at 12:29:25PM +0000, Michal Rostecki wrote:
> > > > On Wed, Feb 10, 2021 at 05:24:28AM +0100, Michał Mirosław wrote:
> > > > > This looks like it effectively decreases queue depth for non-last
> > > > > device. After all devices are filled to queue_depth-penalty, only
> > > > > a single mirror will be selected for next reads (until a read on
> > > > > some other one completes).
> > > > >
> > > >
> > > > Good point. And if all devices are going to be filled for longer time,
> > > > this function will keep selecting the last one. Maybe I should select
> > > > last+1 in that case. Would that address your concern or did you have any
> > > > other solution in mind?
> > >
> > > The best would be to postpone the selection until one device becomes free
> > > again. But if that's not doable, then yes, you could make sure it stays
> > > round-robin after filling the queues (the scheduling will loose the
> > > "penalty"-driven adjustment though).
> >
> > Or another idea - when all the queues are filled, return the mirror
> > which has the lowest load (inflight + penalty), even though it's greater
> > than queue depth. In that case the scheduling will not lose the penalty
> > adjustment and the load is going to be spreaded more fair.
> >
> > I'm not sure if postponing the selection is that good idea. I think it's
> > better if the request is added to the iosched queue anyway, even if the
> > disks' queues are filled, and let the I/O scheduler handle that. The
> > length of the iosched queue (nr_requests, attribute of the iosched) is
> > usually greater than queue depth (attribute of the devide), which means
> > that it's fine to schedule more requests for iosched to handle.
> >
> > IMO btrfs should use the information given by iosched only for heuristic
> > mirror selection, rather than implement its own throttling logic.
> >
> > Does it make sense to you?
> >
> > An another idea could be an additional iteration in regard to
> > nr_requests, if all load values are greater than queue depths, though it
> > might be an overkill. I would prefer to stick to my first idea if
> > everyone agrees.
>
> What if iosched could provide an estimate of request's latency? Then
> btrfs could always select the lowest. For reads from NVME/SSD I would
> normally expect something simple: speed_factor * (pending_bytes + req_bytes).
> For HDDs this could do more computation like looking into what is there
> in the queue already.
>
> This would deviate from simple round-robin scheme, though.
I think that idea is close to what Anand Jain did in his latency
policy, though instead of trying to predict the latency of the currently
scheduled request, it uses stats about previous read operations:
https://patchwork.kernel.org/project/linux-btrfs/patch/63f6f00e2ecc741efd2200c3c87b5db52c6be2fd.1611114341.git.anand.jain@oracle.com/
Indeed your idea would deviate from the simple round-robin, so maybe it
would be good to try it, but as a separate policy in a separate
patch(set).
Regards,
Michal
On Wed, Feb 10, 2021 at 04:20:20PM +0800, Anand Jain wrote:
> On 10/02/2021 04:30, Michal Rostecki wrote:
> > The penalty value is an additional value added to the number of inflight
> > requests when a scheduled request is non-local (which means it would
> > start from the different physical location than the physical location of
> > the last request processed by the given device). By default, it's
> > applied only in filesystems which have mixed types of devices
> > (non-rotational and rotational), but it can be configured to be applied
> > without that condition.
> >
> > The configuration is done through sysfs:
> > > - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonlocal_inc_mixed_only
> >
> > where 1 (the default) value means applying penalty only in mixed arrays,
> > 0 means applying it unconditionally.
> >
> > The exact penalty value is defined separately for non-rotational and
> > rotational devices. By default, it's 0 for non-rotational devices and 1
> > for rotational devices. Both values are configurable through sysfs:
> >
> > - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonrot_nonlocal_inc
> > - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_rot_nonlocal_inc
> >
> > To sum it up - the default case is applying the penalty under the
> > following conditions:
> >
> > - the raid1 array consists of mixed types of devices
> > - the scheduled request is going to be non-local for the given disk
> > - the device is rotational
> >
> > That default case is based on a slight preference towards non-rotational
> > disks in mixed arrays and has proven to give the best performance in
> > tested arrays.
> >> For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s
> > (429MB/s) performance. Adding the penalty value 1 resulted in a
> > performance drop to 404MiB/s (424MB/s). Increasing the value towards 10
> > was making the performance even worse.
> >
> > For the array with 2 HDDs and 1 SSD, adding penalty value 1 to
> > rotational disks resulted in the best performance - 541MiB/s (567MB/s).
> > Not adding any value and increasing the value was making the performance
> > worse.
> > > Adding penalty value to non-rotational disks was always decreasing the
> > performance, which motivated setting it as 0 by default. For the purpose
> > of testing, it's still configurable.
> >
> > To measure the performance of each policy and find optimal penalty
> > values, I created scripts which are available here:
> >
>
> So in summary
> rotational + non-rotational: penalty = 1
> all-rotational and homo : penalty = 0
> all-non-rotational and homo: penalty = 0
>
> I can't find any non-deterministic in your findings above.
> It is not very clear to me if we need the configurable
> parameters here.
>
Honestly, the main reason why I made it configurable is to check the
performance of different values without editing and recompiling the
kernel. I was trying to find the best set of values with my simple Python
script which tries all values from 0 to 9 and runs fio. I left those
partameters to be configurable in my patches just in case someone else
would like to try to tune them on their environments.
The script is here:
https://github.com/mrostecki/btrfs-perf/blob/main/roundrobin-tune.py
But on the other hand, as I mentioned in the other mail - I'm getting
skeptical about having the whole penalty mechanism in general. As I
wrote and as you pointed, it improves the performance only for mixed
arrays. And since the roundrobin policy doesn't perform on mixed as good
as policies you proposed, but it performs good on homogeneous arays,
maybe it's better if I just focus on homogeneous case, and save some CPU
cycles by not storing physical locations.
> It is better to have random workloads in the above three categories
> of configs.
>
> Apart from the above three configs, there is also
> all-non-rotational with hetero
> For example, ssd and nvme together both are non-rotational.
> And,
> all-rotational with hetero
> For example, rotational disks with different speeds.
>
>
> The inflight calculation is local to btrfs. If the device is busy due to
> external factors, it would not switch to the better performing device.
>
Good point. Maybe I should try to use the part stats instead of storing
inflight locally in btrfs.
Hi Anand,
re: inflight calculation
On Thu, Feb 11, 2021 at 03:55:33PM +0000, Michal Rostecki wrote:
> > It is better to have random workloads in the above three categories
> > of configs.
> >
> > Apart from the above three configs, there is also
> > all-non-rotational with hetero
> > For example, ssd and nvme together both are non-rotational.
> > And,
> > all-rotational with hetero
> > For example, rotational disks with different speeds.
> >
> >
> > The inflight calculation is local to btrfs. If the device is busy due to
> > external factors, it would not switch to the better performing device.
> >
>
> Good point. Maybe I should try to use the part stats instead of storing
> inflight locally in btrfs.
I tried today to reuse the inflight calculation which is done for
iostat. And I came to conclusion that it's impossible without exporting
some methods from the block/ subsystem.
The thing is that there are two methods of calculating inflight. Which
one of them is used, depends on queue_is_mq():
https://github.com/kdave/btrfs-devel/blob/9d294a685fbcb256ce8c5f7fd88a7596d0f52a8a/block/genhd.c#L1163
And if that condition is true, I noticed that part_stats return 0, even
though there are processed requests (I checked with fio inside VM).
In general, those two methods of checking inflight are:
1) part_stats - which would be trivial to reuse, just a matter of one
simple for_each_possible_cpu() loop with part_stat_local_read_cpu()
https://github.com/kdave/btrfs-devel/blob/9d294a685fbcb256ce8c5f7fd88a7596d0f52a8a/block/genhd.c#L133-L146
2) blk_mq_in_flight() - which has a lot of code and unexported
structs inside the block/ directory, double function callback;
definitely not easy to reimplement easily in btrfs without copying
dozens of lines
https://github.com/kdave/btrfs-devel/blob/9d294a685fbcb256ce8c5f7fd88a7596d0f52a8a/block/blk-mq.c#L115-L123
Well, I tried copying the whole blk_mq_in_flight() function with all
dependencies anyway, hard to do without causing modpost errors.
So, to sum it up, I think that making 2) possible to reuse in btrfs
would require at lest exporting the blk_mq_in_flight() function,
therefore the change would have to go through linux-block tree. Which
maybe would be a good thing to do in long term, not sure if it should
block my patchset entirely.
The question is if we are fine with inflight stats inside btrfs.
Personally I think we sholdn't be concerned too much about it. The
inflight counter in my patches is a percpu counted, used in places which
already do some atomic operations.
Thanks,
Michal
On Wed, Feb 10, 2021 at 05:08:05AM +0100, Michał Mirosław wrote:
> On Tue, Feb 09, 2021 at 09:30:38PM +0100, Michal Rostecki wrote:
> > From: Michal Rostecki <[email protected]>
> >
> > Add the btrfs_check_mixed() function which checks if the filesystem has
> > the mixed type of devices (non-rotational and rotational). This
> > information is going to be used in roundrobin raid1 read policy.a
> [...]
> > @@ -669,8 +699,12 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
> > }
> >
> > q = bdev_get_queue(bdev);
> > - if (!blk_queue_nonrot(q))
> > + rotating = !blk_queue_nonrot(q);
> > + device->rotating = rotating;
> > + if (rotating)
> > fs_devices->rotating = true;
> > + if (!fs_devices->mixed)
> > + fs_devices->mixed = btrfs_check_mixed(fs_devices, rotating);
> [...]
>
> Since this is adding to a set, a faster way is:
>
> if (fs_devices->rotating != rotating)
> fs_devices->mixed = true;
>
> The scan might be necessary on device removal, though.
>
Actually, that's not going to work in case of appenging a rotational
device when all previous devices are non-rotational.
if (rotating)
fs_devices->rotating = true;
if (fs_devices->rotating != rotating)
fs_devices->mixed = true;
If all devices are non-rotational, we start with the following
attributes:
fs_devices->rotating: false
fs_devices->mixed: false
Then, while appending a rotational disk, we have:
rotating = true;
if (rotating) // if (true)
fs_devices->rotating = true; // overriding with `true`
if (fs_devices->rotating != rotating) // if (true != true), which is false
fs_devices->mixed = true; // NOT EXECUTED
So we end up fs_devices->mixed being `false`, despite having a mixed
array.
Inverting the order of those `if` checks would break the other
permuitations which start with rotational disks.
Therefore, to cover all cases, I think we need a full check, always.
Regards,
Michal
On Fri, Feb 12, 2021 at 06:26:41PM +0000, Michal Rostecki wrote:
> On Wed, Feb 10, 2021 at 05:08:05AM +0100, Micha? Miros?aw wrote:
> > On Tue, Feb 09, 2021 at 09:30:38PM +0100, Michal Rostecki wrote:
> > > From: Michal Rostecki <[email protected]>
> > >
> > > Add the btrfs_check_mixed() function which checks if the filesystem has
> > > the mixed type of devices (non-rotational and rotational). This
> > > information is going to be used in roundrobin raid1 read policy.a
> > [...]
> > > @@ -669,8 +699,12 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
> > > }
> > >
> > > q = bdev_get_queue(bdev);
> > > - if (!blk_queue_nonrot(q))
> > > + rotating = !blk_queue_nonrot(q);
> > > + device->rotating = rotating;
> > > + if (rotating)
> > > fs_devices->rotating = true;
> > > + if (!fs_devices->mixed)
> > > + fs_devices->mixed = btrfs_check_mixed(fs_devices, rotating);
> > [...]
> >
> > Since this is adding to a set, a faster way is:
> >
> > if (fs_devices->rotating != rotating)
> > fs_devices->mixed = true;
> >
> > The scan might be necessary on device removal, though.
> Actually, that's not going to work in case of appenging a rotational
> device when all previous devices are non-rotational.
[...]
> Inverting the order of those `if` checks would break the other
> permuitations which start with rotational disks.
But not if you would add:
if (adding first device)
fs_devices->rotating = rotating;
before the checks.
But them, there is a simpler way: count how many rotating vs non-rotating
devices there are while adding them. Like:
rotating ? ++n_rotating : ++n_fixed;
And then on remove you'd have it covered.
Best Regards
Micha? Miros?aw
On Tue, Feb 09, 2021 at 09:30:39PM +0100, Michal Rostecki wrote:
> From: Michal Rostecki <[email protected]>
>
> Before this change, raid1 read policy could be selected by using the
> /sys/fs/btrfs/[fsid]/read_policy file.
>
> Change it to /sys/fs/btrfs/[fsid]/read_policies/policy.
>
> The motivation behing creating the read_policies directory is that the
> next changes and new read policies are going to intruduce settings
> specific to read policies.
No Documentation/ABI/ update for this change?
On Sat, Feb 13, 2021 at 11:19:28AM +0100, Greg KH wrote:
> On Tue, Feb 09, 2021 at 09:30:39PM +0100, Michal Rostecki wrote:
> > From: Michal Rostecki <[email protected]>
> >
> > Before this change, raid1 read policy could be selected by using the
> > /sys/fs/btrfs/[fsid]/read_policy file.
> >
> > Change it to /sys/fs/btrfs/[fsid]/read_policies/policy.
> >
> > The motivation behing creating the read_policies directory is that the
> > next changes and new read policies are going to intruduce settings
> > specific to read policies.
>
> No Documentation/ABI/ update for this change?
>
Good point. As far as I see, we have no btrfs-related file there. I will
add it to Documentation/ABI/testing in v2.
I guess we also need to document all the options from btrfs that are
unaffected by this particular patchset? Would it make sense to cover
them in a separate patch?
On Sat, Feb 13, 2021 at 12:36:02AM +0100, Michał Mirosław wrote:
> On Fri, Feb 12, 2021 at 06:26:41PM +0000, Michal Rostecki wrote:
> > On Wed, Feb 10, 2021 at 05:08:05AM +0100, Michał Mirosław wrote:
> > > On Tue, Feb 09, 2021 at 09:30:38PM +0100, Michal Rostecki wrote:
> > > > From: Michal Rostecki <[email protected]>
> > > >
> > > > Add the btrfs_check_mixed() function which checks if the filesystem has
> > > > the mixed type of devices (non-rotational and rotational). This
> > > > information is going to be used in roundrobin raid1 read policy.a
> > > [...]
> > > > @@ -669,8 +699,12 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
> > > > }
> > > >
> > > > q = bdev_get_queue(bdev);
> > > > - if (!blk_queue_nonrot(q))
> > > > + rotating = !blk_queue_nonrot(q);
> > > > + device->rotating = rotating;
> > > > + if (rotating)
> > > > fs_devices->rotating = true;
> > > > + if (!fs_devices->mixed)
> > > > + fs_devices->mixed = btrfs_check_mixed(fs_devices, rotating);
> > > [...]
> > >
> > > Since this is adding to a set, a faster way is:
> > >
> > > if (fs_devices->rotating != rotating)
> > > fs_devices->mixed = true;
> > >
> > > The scan might be necessary on device removal, though.
> > Actually, that's not going to work in case of appenging a rotational
> > device when all previous devices are non-rotational.
> [...]
> > Inverting the order of those `if` checks would break the other
> > permuitations which start with rotational disks.
>
> But not if you would add:
>
> if (adding first device)
> fs_devices->rotating = rotating;
>
> before the checks.
>
> But them, there is a simpler way: count how many rotating vs non-rotating
> devices there are while adding them. Like:
>
> rotating ? ++n_rotating : ++n_fixed;
>
> And then on remove you'd have it covered.
I like the idea of storing numbers and simply checking them. I use it in
v2 - though probably in a different form, and I will most likely move
the whole logic around checking device types to separate functions, to
not bloat btrfs_open_one_device() and the others too much.
On Mon, Feb 15, 2021 at 02:35:35PM +0000, Michal Rostecki wrote:
> On Sat, Feb 13, 2021 at 11:19:28AM +0100, Greg KH wrote:
> > On Tue, Feb 09, 2021 at 09:30:39PM +0100, Michal Rostecki wrote:
> > > From: Michal Rostecki <[email protected]>
> > >
> > > Before this change, raid1 read policy could be selected by using the
> > > /sys/fs/btrfs/[fsid]/read_policy file.
> > >
> > > Change it to /sys/fs/btrfs/[fsid]/read_policies/policy.
> > >
> > > The motivation behing creating the read_policies directory is that the
> > > next changes and new read policies are going to intruduce settings
> > > specific to read policies.
> >
> > No Documentation/ABI/ update for this change?
> >
>
> Good point. As far as I see, we have no btrfs-related file there. I will
> add it to Documentation/ABI/testing in v2.
>
> I guess we also need to document all the options from btrfs that are
> unaffected by this particular patchset? Would it make sense to cover
> them in a separate patch?
That would be a good idea to do as a separate and new patch, yes.
thanks,
greg k-h