LinuxLists.cc - [PATCH md-6.9 v2 00/10] md/raid1: refactor read

2024-02-27 12:11:46

Subject: [PATCH md-6.9 v2 00/10] md/raid1: refactor read_balance() and some minor fix

From: Yu Kuai <[email protected]>

Changes in v2:
- add new conter in conf for patch 2;
- fix the case choose next idle while there is no other idle disk in
patch 3;
- add some review tag from Xiao Ni for patch 1, 4-8

The original idea is that Paul want to optimize raid1 read
performance([1]), however, we think that the original code for
read_balance() is quite complex, and we don't want to add more
complexity. Hence we decide to refactor read_balance() first, to make
code cleaner and easier for follow up.

Before this patchset, read_balance() has many local variables and many
branches, it want to consider all the scenarios in one iteration. The
idea of this patch is to divide them into 4 different steps:

1) If resync is in progress, find the first usable disk, patch 5;
Otherwise:
2) Loop through all disks and skipping slow disks and disks with bad
blocks, choose the best disk, patch 10. If no disk is found:
3) Look for disks with bad blocks and choose the one with most number of
sectors, patch 8. If no disk is found:
4) Choose first found slow disk with no bad blocks, or slow disk with
most number of sectors, patch 7.

Note that step 3) and step 4) are super code path, and performance
should not be considered.

And after this patchset, we'll continue to optimize read_balance for
step 2), specifically how to choose the best rdev to read.

[1] https://lore.kernel.org/all/[email protected]/

Yu Kuai (10):
md: add a new helper rdev_has_badblock()
md/raid1: record nonrot rdevs while adding/removing rdevs to conf
md/raid1: fix choose next idle in read_balance()
md/raid1-10: add a helper raid1_check_read_range()
md/raid1-10: factor out a new helper raid1_should_read_first()
md/raid1: factor out read_first_rdev() from read_balance()
md/raid1: factor out choose_slow_rdev() from read_balance()
md/raid1: factor out choose_bb_rdev() from read_balance()
md/raid1: factor out the code to manage sequential IO
md/raid1: factor out helpers to choose the best rdev from
read_balance()

drivers/md/md.h | 11 +
drivers/md/raid1-10.c | 69 +++++++
drivers/md/raid1.c | 465 +++++++++++++++++++++++++-----------------
drivers/md/raid1.h | 1 +
drivers/md/raid10.c | 58 ++----
drivers/md/raid5.c | 35 ++--
6 files changed, 391 insertions(+), 248 deletions(-)

--
2.39.2

2024-02-27 12:11:52

by Yu Kuai

[permalink] [raw]

Subject: [PATCH md-6.9 v2 09/10] md/raid1: factor out the code to manage sequential IO

From: Yu Kuai <[email protected]>

There is no functional change for now, make read_balance() cleaner and
prepare to fix problems and refactor the handler of sequential IO.

Co-developed-by: Paul Luse <[email protected]>
Signed-off-by: Paul Luse <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid1.c | 71 ++++++++++++++++++++++++----------------------
1 file changed, 37 insertions(+), 34 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 76bb59ad1485..d3e9a0157437 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -705,6 +705,31 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
return bb_disk;
}

+static bool is_sequential(struct r1conf *conf, int disk, struct r1bio *r1_bio)
+{
+ /* TODO: address issues with this check and concurrency. */
+ return conf->mirrors[disk].next_seq_sect == r1_bio->sector ||
+ conf->mirrors[disk].head_position == r1_bio->sector;
+}
+
+/*
+ * If buffered sequential IO size exceeds optimal iosize, check if there is idle
+ * disk. If yes, choose the idle disk.
+ */
+static bool should_choose_next(struct r1conf *conf, int disk)
+{
+ struct raid1_info *mirror = &conf->mirrors[disk];
+ int opt_iosize;
+
+ if (!test_bit(Nonrot, &mirror->rdev->flags))
+ return false;
+
+ opt_iosize = bdev_io_opt(mirror->rdev->bdev) >> 9;
+ return opt_iosize > 0 && mirror->seq_start != MaxSector &&
+ mirror->next_seq_sect > opt_iosize &&
+ mirror->next_seq_sect - opt_iosize >= mirror->seq_start;
+}
+
/*
* This routine returns the disk from which the requested read should
* be done. There is a per-array 'next expected sequential IO' sector
@@ -768,43 +793,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
pending = atomic_read(&rdev->nr_pending);
dist = abs(this_sector - conf->mirrors[disk].head_position);
/* Don't change to another disk for sequential reads */
- if (conf->mirrors[disk].next_seq_sect == this_sector
- || dist == 0) {
- int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
- struct raid1_info *mirror = &conf->mirrors[disk];
-
- /*
- * If buffered sequential IO size exceeds optimal
- * iosize, check if there is idle disk. If yes, choose
- * the idle disk. read_balance could already choose an
- * idle disk before noticing it's a sequential IO in
- * this disk. This doesn't matter because this disk
- * will idle, next time it will be utilized after the
- * first disk has IO size exceeds optimal iosize. In
- * this way, iosize of the first disk will be optimal
- * iosize at least. iosize of the second disk might be
- * small, but not a big deal since when the second disk
- * starts IO, the first disk is likely still busy.
- */
- if (test_bit(Nonrot, &rdev->flags) && opt_iosize > 0 &&
- mirror->seq_start != MaxSector &&
- mirror->next_seq_sect > opt_iosize &&
- mirror->next_seq_sect - opt_iosize >=
- mirror->seq_start) {
- /*
- * Add 'pending' to avoid choosing this disk if
- * there is other idle disk.
- */
- pending++;
- /*
- * If there is no other idle disk, this disk
- * will be chosen.
- */
- sequential_disk = disk;
- } else {
+ if (is_sequential(conf, disk, r1_bio)) {
+ if (!should_choose_next(conf, disk)) {
best_disk = disk;
break;
}
+ /*
+ * Add 'pending' to avoid choosing this disk if
+ * there is other idle disk.
+ */
+ pending++;
+ /*
+ * If there is no other idle disk, this disk
+ * will be chosen.
+ */
+ sequential_disk = disk;
}

if (min_pending > pending) {
--
2.39.2

2024-02-27 12:12:00

by Yu Kuai

[permalink] [raw]

Subject: [PATCH md-6.9 v2 07/10] md/raid1: factor out choose_slow_rdev() from read_balance()

From: Yu Kuai <[email protected]>

read_balance() is hard to understand because there are too many status
and branches, and it's overlong.

This patch factor out the case to read the slow rdev from
read_balance(), there are no functional changes.

Co-developed-by: Paul Luse <[email protected]>
Signed-off-by: Paul Luse <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
Reviewed-by: Xiao Ni <[email protected]>
---
drivers/md/raid1.c | 69 ++++++++++++++++++++++++++++++++++------------
1 file changed, 52 insertions(+), 17 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 3eeaef7f8ded..407e2bf5c322 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -620,6 +620,53 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
return -1;
}

+static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
+ int *max_sectors)
+{
+ sector_t this_sector = r1_bio->sector;
+ int bb_disk = -1;
+ int bb_read_len = 0;
+ int disk;
+
+ for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
+ struct md_rdev *rdev;
+ int len;
+ int read_len;
+
+ if (r1_bio->bios[disk] == IO_BLOCKED)
+ continue;
+
+ rdev = conf->mirrors[disk].rdev;
+ if (!rdev || test_bit(Faulty, &rdev->flags) ||
+ !test_bit(WriteMostly, &rdev->flags))
+ continue;
+
+ /* there are no bad blocks, we can use this disk */
+ len = r1_bio->sectors;
+ read_len = raid1_check_read_range(rdev, this_sector, &len);
+ if (read_len == r1_bio->sectors) {
+ update_read_sectors(conf, disk, this_sector, read_len);
+ return disk;
+ }
+
+ /*
+ * there are partial bad blocks, choose the rdev with largest
+ * read length.
+ */
+ if (read_len > bb_read_len) {
+ bb_disk = disk;
+ bb_read_len = read_len;
+ }
+ }
+
+ if (bb_disk != -1) {
+ *max_sectors = bb_read_len;
+ update_read_sectors(conf, bb_disk, this_sector, bb_read_len);
+ }
+
+ return bb_disk;
+}
+
/*
* This routine returns the disk from which the requested read should
* be done. There is a per-array 'next expected sequential IO' sector
@@ -673,23 +720,8 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
if (!test_bit(In_sync, &rdev->flags) &&
rdev->recovery_offset < this_sector + sectors)
continue;
- if (test_bit(WriteMostly, &rdev->flags)) {
- /* Don't balance among write-mostly, just
- * use the first as a last resort */
- if (best_dist_disk < 0) {
- if (is_badblock(rdev, this_sector, sectors,
- &first_bad, &bad_sectors)) {
- if (first_bad <= this_sector)
- /* Cannot use this */
- continue;
- best_good_sectors = first_bad - this_sector;
- } else
- best_good_sectors = sectors;
- best_dist_disk = disk;
- best_pending_disk = disk;
- }
+ if (test_bit(WriteMostly, &rdev->flags))
continue;
- }
/* This is a reasonable device to use. It might
* even be best.
*/
@@ -808,7 +840,10 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
}
*max_sectors = sectors;

- return best_disk;
+ if (best_disk >= 0)
+ return best_disk;
+
+ return choose_slow_rdev(conf, r1_bio, max_sectors);
}

static void wake_up_barrier(struct r1conf *conf)
--
2.39.2

2024-02-27 12:12:03

by Yu Kuai

[permalink] [raw]

Subject: [PATCH md-6.9 v2 05/10] md/raid1-10: factor out a new helper raid1_should_read_first()

From: Yu Kuai <[email protected]>

If resync is in progress, read_balance() should find the first usable
disk, otherwise, data could be inconsistent after resync is done. raid1
and raid10 implement the same checking, hence factor out the checking
to make code cleaner.

Noted that raid1 is using 'mddev->recovery_cp', which is updated after
all resync IO is done, while raid10 is using 'conf->next_resync', which
is inaccurate because raid10 update it before submitting resync IO.
Fortunately, raid10 read IO can't concurrent with resync IO, hence there
is no problem. And this patch also switch raid10 to use
'mddev->recovery_cp'.

Co-developed-by: Paul Luse <[email protected]>
Signed-off-by: Paul Luse <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
Reviewed-by: Xiao Ni <[email protected]>
---
drivers/md/raid1-10.c | 20 ++++++++++++++++++++
drivers/md/raid1.c | 15 ++-------------
drivers/md/raid10.c | 13 ++-----------
3 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
index 9bc0f0022a6c..2ea1710a3b70 100644
--- a/drivers/md/raid1-10.c
+++ b/drivers/md/raid1-10.c
@@ -276,3 +276,23 @@ static inline int raid1_check_read_range(struct md_rdev *rdev,
*len = first_bad + bad_sectors - this_sector;
return 0;
}
+
+/*
+ * Check if read should choose the first rdev.
+ *
+ * Balance on the whole device if no resync is going on (recovery is ok) or
+ * below the resync window. Otherwise, take the first readable disk.
+ */
+static inline bool raid1_should_read_first(struct mddev *mddev,
+ sector_t this_sector, int len)
+{
+ if ((mddev->recovery_cp < this_sector + len))
+ return true;
+
+ if (mddev_is_clustered(mddev) &&
+ md_cluster_ops->area_resyncing(mddev, READ, this_sector,
+ this_sector + len))
+ return true;
+
+ return false;
+}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index fc5899fb08c1..640d5d8f789a 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -605,11 +605,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
struct md_rdev *rdev;
int choose_first;

- /*
- * Check if we can balance. We can balance on the whole
- * device if no resync is going on, or below the resync window.
- * We take the first readable disk when above the resync window.
- */
retry:
sectors = r1_bio->sectors;
best_disk = -1;
@@ -619,16 +614,10 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
best_pending_disk = -1;
min_pending = UINT_MAX;
best_good_sectors = 0;
+ choose_first = raid1_should_read_first(conf->mddev, this_sector,
+ sectors);
clear_bit(R1BIO_FailFast, &r1_bio->state);

- if ((conf->mddev->recovery_cp < this_sector + sectors) ||
- (mddev_is_clustered(conf->mddev) &&
- md_cluster_ops->area_resyncing(conf->mddev, READ, this_sector,
- this_sector + sectors)))
- choose_first = 1;
- else
- choose_first = 0;
-
for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
sector_t dist;
sector_t first_bad;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d5a7a621f0f0..8aecdb1ccc16 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -748,17 +748,8 @@ static struct md_rdev *read_balance(struct r10conf *conf,
best_good_sectors = 0;
do_balance = 1;
clear_bit(R10BIO_FailFast, &r10_bio->state);
- /*
- * Check if we can balance. We can balance on the whole
- * device if no resync is going on (recovery is ok), or below
- * the resync window. We take the first readable disk when
- * above the resync window.
- */
- if ((conf->mddev->recovery_cp < MaxSector
- && (this_sector + sectors >= conf->next_resync)) ||
- (mddev_is_clustered(conf->mddev) &&
- md_cluster_ops->area_resyncing(conf->mddev, READ, this_sector,
- this_sector + sectors)))
+
+ if (raid1_should_read_first(conf->mddev, this_sector, sectors))
do_balance = 0;

for (slot = 0; slot < conf->copies ; slot++) {
--
2.39.2

2024-02-27 12:12:05

by Yu Kuai

[permalink] [raw]

Subject: [PATCH md-6.9 v2 10/10] md/raid1: factor out helpers to choose the best rdev from read_balance()

From: Yu Kuai <[email protected]>

The way that best rdev is chosen:

1) If the read is sequential from one rdev:
- if rdev is rotational, use this rdev;
- if rdev is non-rotational, use this rdev until total read length
exceed disk opt io size;

2) If the read is not sequential:
- if there is idle disk, use it, otherwise:
- if the array has non-rotational disk, choose the rdev with minimal
inflight IO;
- if all the underlaying disks are rotational disk, choose the rdev
with closest IO;

There are no functional changes, just to make code cleaner and prepare
for following refactor.

Co-developed-by: Paul Luse <[email protected]>
Signed-off-by: Paul Luse <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid1.c | 175 +++++++++++++++++++++++++--------------------
1 file changed, 98 insertions(+), 77 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d3e9a0157437..1bdd59d9e6ba 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -730,74 +730,71 @@ static bool should_choose_next(struct r1conf *conf, int disk)
mirror->next_seq_sect - opt_iosize >= mirror->seq_start;
}

-/*
- * This routine returns the disk from which the requested read should
- * be done. There is a per-array 'next expected sequential IO' sector
- * number - if this matches on the next IO then we use the last disk.
- * There is also a per-disk 'last know head position' sector that is
- * maintained from IRQ contexts, both the normal and the resync IO
- * completion handlers update this position correctly. If there is no
- * perfect sequential match then we pick the disk whose head is closest.
- *
- * If there are 2 mirrors in the same 2 devices, performance degrades
- * because position is mirror, not device based.
- *
- * The rdev for the device selected will have nr_pending incremented.
- */
-static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sectors)
+static bool rdev_readable(struct md_rdev *rdev, struct r1bio *r1_bio)
{
- const sector_t this_sector = r1_bio->sector;
- int sectors;
- int best_good_sectors;
- int best_disk, best_dist_disk, best_pending_disk, sequential_disk;
- int disk;
- sector_t best_dist;
- unsigned int min_pending;
- struct md_rdev *rdev;
+ if (!rdev || test_bit(Faulty, &rdev->flags))
+ return false;

- retry:
- sectors = r1_bio->sectors;
- best_disk = -1;
- best_dist_disk = -1;
- sequential_disk = -1;
- best_dist = MaxSector;
- best_pending_disk = -1;
- min_pending = UINT_MAX;
- best_good_sectors = 0;
- clear_bit(R1BIO_FailFast, &r1_bio->state);
+ /* still in recovery */
+ if (!test_bit(In_sync, &rdev->flags) &&
+ rdev->recovery_offset < r1_bio->sector + r1_bio->sectors)
+ return false;

- if (raid1_should_read_first(conf->mddev, this_sector, sectors))
- return choose_first_rdev(conf, r1_bio, max_sectors);
+ /* don't read from slow disk unless have to */
+ if (test_bit(WriteMostly, &rdev->flags))
+ return false;
+
+ /* don't split IO for bad blocks unless have to */
+ if (rdev_has_badblock(rdev, r1_bio->sector, r1_bio->sectors))
+ return false;
+
+ return true;
+}
+
+struct read_balance_ctl {
+ sector_t closest_dist;
+ int closest_dist_disk;
+ int min_pending;
+ int min_pending_disk;
+ int sequential_disk;
+ int readable_disks;
+};
+
+static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
+{
+ int disk;
+ struct read_balance_ctl ctl = {
+ .closest_dist_disk = -1,
+ .closest_dist = MaxSector,
+ .min_pending_disk = -1,
+ .min_pending = UINT_MAX,
+ .sequential_disk = -1,
+ };

for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
+ struct md_rdev *rdev;
sector_t dist;
unsigned int pending;

- rdev = conf->mirrors[disk].rdev;
- if (r1_bio->bios[disk] == IO_BLOCKED
- || rdev == NULL
- || test_bit(Faulty, &rdev->flags))
- continue;
- if (!test_bit(In_sync, &rdev->flags) &&
- rdev->recovery_offset < this_sector + sectors)
- continue;
- if (test_bit(WriteMostly, &rdev->flags))
+ if (r1_bio->bios[disk] == IO_BLOCKED)
continue;
- if (rdev_has_badblock(rdev, this_sector, sectors))
+
+ rdev = conf->mirrors[disk].rdev;
+ if (!rdev_readable(rdev, r1_bio))
continue;

- if (best_disk >= 0)
- /* At least two disks to choose from so failfast is OK */
+ /* At least two disks to choose from so failfast is OK */
+ if (ctl.readable_disks++ == 1)
set_bit(R1BIO_FailFast, &r1_bio->state);

pending = atomic_read(&rdev->nr_pending);
- dist = abs(this_sector - conf->mirrors[disk].head_position);
+ dist = abs(r1_bio->sector - conf->mirrors[disk].head_position);
+
/* Don't change to another disk for sequential reads */
if (is_sequential(conf, disk, r1_bio)) {
- if (!should_choose_next(conf, disk)) {
- best_disk = disk;
- break;
- }
+ if (!should_choose_next(conf, disk))
+ return disk;
+
/*
* Add 'pending' to avoid choosing this disk if
* there is other idle disk.
@@ -807,17 +804,17 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
* If there is no other idle disk, this disk
* will be chosen.
*/
- sequential_disk = disk;
+ ctl.sequential_disk = disk;
}

- if (min_pending > pending) {
- min_pending = pending;
- best_pending_disk = disk;
+ if (ctl.min_pending > pending) {
+ ctl.min_pending = pending;
+ ctl.min_pending_disk = disk;
}

- if (dist < best_dist) {
- best_dist = dist;
- best_dist_disk = disk;
+ if (ctl.closest_dist > dist) {
+ ctl.closest_dist = dist;
+ ctl.closest_dist_disk = disk;
}
}

@@ -825,8 +822,8 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
* sequential IO size exceeds optimal iosize, however, there is no other
* idle disk, so choose the sequential disk.
*/
- if (best_disk == -1 && min_pending != 0)
- best_disk = sequential_disk;
+ if (ctl.sequential_disk != -1 && ctl.min_pending != 0)
+ return ctl.sequential_disk;

/*
* If all disks are rotational, choose the closest disk. If any disk is
@@ -834,25 +831,49 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
* disk is rotational, which might/might not be optimal for raids with
* mixed ratation/non-rotational disks depending on workload.
*/
- if (best_disk == -1) {
- if (READ_ONCE(conf->nonrot_disks) || min_pending == 0)
- best_disk = best_pending_disk;
- else
- best_disk = best_dist_disk;
- }
+ if (ctl.min_pending_disk != -1 &&
+ (READ_ONCE(conf->nonrot_disks) || ctl.min_pending == 0))
+ return ctl.min_pending_disk;
+ else
+ return ctl.closest_dist_disk;
+}

- if (best_disk >= 0) {
- rdev = conf->mirrors[best_disk].rdev;
- if (!rdev)
- goto retry;
+/*
+ * This routine returns the disk from which the requested read should be done.
+ *
+ * 1) If resync is in progress, find the first usable disk and use it even if it
+ * has some bad blocks.
+ *
+ * 2) Now that there is no resync, loop through all disks and skipping slow
+ * disks and disks with bad blocks for now. Only pay attention to key disk
+ * choice.
+ *
+ * 3) If we've made it this far, now look for disks with bad blocks and choose
+ * the one with most number of sectors.
+ *
+ * 4) If we are all the way at the end, we have no choice but to use a disk even
+ * if it is write mostly.
+ *
+ * The rdev for the device selected will have nr_pending incremented.
+ */
+static int read_balance(struct r1conf *conf, struct r1bio *r1_bio,
+ int *max_sectors)
+{
+ int disk;

- sectors = best_good_sectors;
- update_read_sectors(conf, disk, this_sector, sectors);
- }
- *max_sectors = sectors;
+ clear_bit(R1BIO_FailFast, &r1_bio->state);
+
+ if (raid1_should_read_first(conf->mddev, r1_bio->sector,
+ r1_bio->sectors))
+ return choose_first_rdev(conf, r1_bio, max_sectors);

- if (best_disk >= 0)
- return best_disk;
+ disk = choose_best_rdev(conf, r1_bio);
+ if (disk >= 0) {
+ *max_sectors = r1_bio->sectors;
+ update_read_sectors(conf, disk, r1_bio->sector,
+ r1_bio->sectors);
+ return disk;
+ }

/*
* If we are here it means we didn't find a perfectly good disk so
--
2.39.2

2024-02-27 12:16:44

by Yu Kuai

[permalink] [raw]

Subject: [PATCH md-6.9 v2 02/10] md/raid1: record nonrot rdevs while adding/removing rdevs to conf

From: Yu Kuai <[email protected]>

For raid1, each read will iterate all the rdevs from conf and check if
any rdev is non-rotational, then choose rdev with minimal IO inflight
if so, or rdev with closest distance otherwise.

Disk nonrot info can be changed through sysfs entry:

/sys/block/[disk_name]/queue/rotational

However, consider that this should only be used for testing, and user
really shouldn't do this in real life. Record the number of non-rotational
disks in conf, to avoid checking each rdev in IO fast path and simplify
read_balance() a little bit.

Co-developed-by: Paul Luse <[email protected]>
Signed-off-by: Paul Luse <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.h | 1 +
drivers/md/raid1.c | 17 ++++++++++-------
drivers/md/raid1.h | 1 +
3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index a49ab04ab707..b2076a165c10 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -207,6 +207,7 @@ enum flag_bits {
* check if there is collision between raid1
* serial bios.
*/
+ Nonrot, /* non-rotational device (SSD) */
};

static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a145fe48b9ce..0fed01b06de9 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -599,7 +599,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
int sectors;
int best_good_sectors;
int best_disk, best_dist_disk, best_pending_disk;
- int has_nonrot_disk;
int disk;
sector_t best_dist;
unsigned int min_pending;
@@ -620,7 +619,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
best_pending_disk = -1;
min_pending = UINT_MAX;
best_good_sectors = 0;
- has_nonrot_disk = 0;
choose_next_idle = 0;
clear_bit(R1BIO_FailFast, &r1_bio->state);

@@ -637,7 +635,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
sector_t first_bad;
int bad_sectors;
unsigned int pending;
- bool nonrot;

rdev = conf->mirrors[disk].rdev;
if (r1_bio->bios[disk] == IO_BLOCKED
@@ -703,8 +700,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
/* At least two disks to choose from so failfast is OK */
set_bit(R1BIO_FailFast, &r1_bio->state);

- nonrot = bdev_nonrot(rdev->bdev);
- has_nonrot_disk |= nonrot;
pending = atomic_read(&rdev->nr_pending);
dist = abs(this_sector - conf->mirrors[disk].head_position);
if (choose_first) {
@@ -731,7 +726,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
* small, but not a big deal since when the second disk
* starts IO, the first disk is likely still busy.
*/
- if (nonrot && opt_iosize > 0 &&
+ if (test_bit(Nonrot, &rdev->flags) && opt_iosize > 0 &&
mirror->seq_start != MaxSector &&
mirror->next_seq_sect > opt_iosize &&
mirror->next_seq_sect - opt_iosize >=
@@ -763,7 +758,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
* mixed ratation/non-rotational disks depending on workload.
*/
if (best_disk == -1) {
- if (has_nonrot_disk || min_pending == 0)
+ if (READ_ONCE(conf->nonrot_disks) || min_pending == 0)
best_disk = best_pending_disk;
else
best_disk = best_dist_disk;
@@ -1819,6 +1814,11 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
WRITE_ONCE(p[conf->raid_disks].rdev, rdev);
}

+ if (!err && bdev_nonrot(rdev->bdev)) {
+ set_bit(Nonrot, &rdev->flags);
+ WRITE_ONCE(conf->nonrot_disks, conf->nonrot_disks + 1);
+ }
+
print_conf(conf);
return err;
}
@@ -1883,6 +1883,9 @@ static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
}
abort:

+ if (test_and_clear_bit(Nonrot, &rdev->flags))
+ WRITE_ONCE(conf->nonrot_disks, conf->nonrot_disks - 1);
+
print_conf(conf);
return err;
}
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index 14d4211a123a..5300cbaa58a4 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -71,6 +71,7 @@ struct r1conf {
* allow for replacements.
*/
int raid_disks;
+ int nonrot_disks;

spinlock_t device_lock;

--
2.39.2

2024-02-27 12:17:45

by Yu Kuai

[permalink] [raw]

Subject: [PATCH md-6.9 v2 08/10] md/raid1: factor out choose_bb_rdev() from read_balance()

From: Yu Kuai <[email protected]>

read_balance() is hard to understand because there are too many status
and branches, and it's overlong.

This patch factor out the case to read the rdev with bad blocks from
read_balance(), there are no functional changes.

Co-developed-by: Paul Luse <[email protected]>
Signed-off-by: Paul Luse <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
Reviewed-by: Xiao Ni <[email protected]>
---
drivers/md/raid1.c | 79 ++++++++++++++++++++++++++++------------------
1 file changed, 48 insertions(+), 31 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 407e2bf5c322..76bb59ad1485 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -620,6 +620,44 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
return -1;
}

+static int choose_bb_rdev(struct r1conf *conf, struct r1bio *r1_bio,
+ int *max_sectors)
+{
+ sector_t this_sector = r1_bio->sector;
+ int best_disk = -1;
+ int best_len = 0;
+ int disk;
+
+ for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
+ struct md_rdev *rdev;
+ int len;
+ int read_len;
+
+ if (r1_bio->bios[disk] == IO_BLOCKED)
+ continue;
+
+ rdev = conf->mirrors[disk].rdev;
+ if (!rdev || test_bit(Faulty, &rdev->flags) ||
+ test_bit(WriteMostly, &rdev->flags))
+ continue;
+
+ /* keep track of the disk with the most readable sectors. */
+ len = r1_bio->sectors;
+ read_len = raid1_check_read_range(rdev, this_sector, &len);
+ if (read_len > best_len) {
+ best_disk = disk;
+ best_len = read_len;
+ }
+ }
+
+ if (best_disk != -1) {
+ *max_sectors = best_len;
+ update_read_sectors(conf, best_disk, this_sector, best_len);
+ }
+
+ return best_disk;
+}
+
static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
int *max_sectors)
{
@@ -708,8 +746,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect

for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
sector_t dist;
- sector_t first_bad;
- int bad_sectors;
unsigned int pending;

rdev = conf->mirrors[disk].rdev;
@@ -722,36 +758,8 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
continue;
if (test_bit(WriteMostly, &rdev->flags))
continue;
- /* This is a reasonable device to use. It might
- * even be best.
- */
- if (is_badblock(rdev, this_sector, sectors,
- &first_bad, &bad_sectors)) {
- if (best_dist < MaxSector)
- /* already have a better device */
- continue;
- if (first_bad <= this_sector) {
- /* cannot read here. If this is the 'primary'
- * device, then we must not read beyond
- * bad_sectors from another device..
- */
- bad_sectors -= (this_sector - first_bad);
- if (best_good_sectors > sectors)
- best_good_sectors = sectors;
-
- } else {
- sector_t good_sectors = first_bad - this_sector;
- if (good_sectors > best_good_sectors) {
- best_good_sectors = good_sectors;
- best_disk = disk;
- }
- }
+ if (rdev_has_badblock(rdev, this_sector, sectors))
continue;
- } else {
- if ((sectors > best_good_sectors) && (best_disk >= 0))
- best_disk = -1;
- best_good_sectors = sectors;
- }

if (best_disk >= 0)
/* At least two disks to choose from so failfast is OK */
@@ -843,6 +851,15 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
if (best_disk >= 0)
return best_disk;

+ /*
+ * If we are here it means we didn't find a perfectly good disk so
+ * now spend a bit more time trying to find one with the most good
+ * sectors.
+ */
+ disk = choose_bb_rdev(conf, r1_bio, max_sectors);
+ if (disk >= 0)
+ return disk;
+
return choose_slow_rdev(conf, r1_bio, max_sectors);
}

--
2.39.2

2024-02-27 21:12:52

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH md-6.9 v2 00/10] md/raid1: refactor read_balance() and some minor fix

On Tue, Feb 27, 2024 at 4:09 AM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> Changes in v2:
> - add new conter in conf for patch 2;
> - fix the case choose next idle while there is no other idle disk in
> patch 3;
> - add some review tag from Xiao Ni for patch 1, 4-8
>
> The original idea is that Paul want to optimize raid1 read
> performance([1]), however, we think that the original code for
> read_balance() is quite complex, and we don't want to add more
> complexity. Hence we decide to refactor read_balance() first, to make
> code cleaner and easier for follow up.
>
> Before this patchset, read_balance() has many local variables and many
> branches, it want to consider all the scenarios in one iteration. The
> idea of this patch is to divide them into 4 different steps:
>
> 1) If resync is in progress, find the first usable disk, patch 5;
> Otherwise:
> 2) Loop through all disks and skipping slow disks and disks with bad
> blocks, choose the best disk, patch 10. If no disk is found:
> 3) Look for disks with bad blocks and choose the one with most number of
> sectors, patch 8. If no disk is found:
> 4) Choose first found slow disk with no bad blocks, or slow disk with
> most number of sectors, patch 7.
>
> Note that step 3) and step 4) are super code path, and performance
> should not be considered.
>
> And after this patchset, we'll continue to optimize read_balance for
> step 2), specifically how to choose the best rdev to read.
>
> [1] https://lore.kernel.org/all/[email protected]/

v2 looks good to me. Thanks! I will give Xiao some more time to review
it one more time before pushing it to md-6.9.

Song

2024-02-28 01:57:21

by Xiao Ni

[permalink] [raw]

Subject: Re: [PATCH md-6.9 v2 02/10] md/raid1: record nonrot rdevs while adding/removing rdevs to conf

On Tue, Feb 27, 2024 at 8:09 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> For raid1, each read will iterate all the rdevs from conf and check if
> any rdev is non-rotational, then choose rdev with minimal IO inflight
> if so, or rdev with closest distance otherwise.
>
> Disk nonrot info can be changed through sysfs entry:
>
> /sys/block/[disk_name]/queue/rotational
>
> However, consider that this should only be used for testing, and user
> really shouldn't do this in real life. Record the number of non-rotational
> disks in conf, to avoid checking each rdev in IO fast path and simplify
> read_balance() a little bit.
>
> Co-developed-by: Paul Luse <[email protected]>
> Signed-off-by: Paul Luse <[email protected]>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/md.h | 1 +
> drivers/md/raid1.c | 17 ++++++++++-------
> drivers/md/raid1.h | 1 +
> 3 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index a49ab04ab707..b2076a165c10 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -207,6 +207,7 @@ enum flag_bits {
> * check if there is collision between raid1
> * serial bios.
> */
> + Nonrot, /* non-rotational device (SSD) */
> };
>
> static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index a145fe48b9ce..0fed01b06de9 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -599,7 +599,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> int sectors;
> int best_good_sectors;
> int best_disk, best_dist_disk, best_pending_disk;
> - int has_nonrot_disk;
> int disk;
> sector_t best_dist;
> unsigned int min_pending;
> @@ -620,7 +619,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> best_pending_disk = -1;
> min_pending = UINT_MAX;
> best_good_sectors = 0;
> - has_nonrot_disk = 0;
> choose_next_idle = 0;
> clear_bit(R1BIO_FailFast, &r1_bio->state);
>
> @@ -637,7 +635,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> sector_t first_bad;
> int bad_sectors;
> unsigned int pending;
> - bool nonrot;
>
> rdev = conf->mirrors[disk].rdev;
> if (r1_bio->bios[disk] == IO_BLOCKED
> @@ -703,8 +700,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> /* At least two disks to choose from so failfast is OK */
> set_bit(R1BIO_FailFast, &r1_bio->state);
>
> - nonrot = bdev_nonrot(rdev->bdev);
> - has_nonrot_disk |= nonrot;
> pending = atomic_read(&rdev->nr_pending);
> dist = abs(this_sector - conf->mirrors[disk].head_position);
> if (choose_first) {
> @@ -731,7 +726,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> * small, but not a big deal since when the second disk
> * starts IO, the first disk is likely still busy.
> */
> - if (nonrot && opt_iosize > 0 &&
> + if (test_bit(Nonrot, &rdev->flags) && opt_iosize > 0 &&
> mirror->seq_start != MaxSector &&
> mirror->next_seq_sect > opt_iosize &&
> mirror->next_seq_sect - opt_iosize >=
> @@ -763,7 +758,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> * mixed ratation/non-rotational disks depending on workload.
> */
> if (best_disk == -1) {
> - if (has_nonrot_disk || min_pending == 0)
> + if (READ_ONCE(conf->nonrot_disks) || min_pending == 0)
> best_disk = best_pending_disk;
> else
> best_disk = best_dist_disk;
> @@ -1819,6 +1814,11 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
> WRITE_ONCE(p[conf->raid_disks].rdev, rdev);
> }
>
> + if (!err && bdev_nonrot(rdev->bdev)) {
> + set_bit(Nonrot, &rdev->flags);
> + WRITE_ONCE(conf->nonrot_disks, conf->nonrot_disks + 1);
> + }
> +

Hi Kuai

I noticed raid1_run->setup_conf is used to add rdev to conf when
creating raid1. raid1_add_disk is only used for --add/--re-add after
creating array. So we need to add the same logic in setup_conf?

Regards
Xiao
> print_conf(conf);
> return err;
> }
> @@ -1883,6 +1883,9 @@ static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
> }
> abort:
>
> + if (test_and_clear_bit(Nonrot, &rdev->flags))
> + WRITE_ONCE(conf->nonrot_disks, conf->nonrot_disks - 1);
> +
> print_conf(conf);
> return err;
> }
> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
> index 14d4211a123a..5300cbaa58a4 100644
> --- a/drivers/md/raid1.h
> +++ b/drivers/md/raid1.h
> @@ -71,6 +71,7 @@ struct r1conf {
> * allow for replacements.
> */
> int raid_disks;
> + int nonrot_disks;
>
> spinlock_t device_lock;
>
> --
> 2.39.2
>

2024-02-28 02:28:17

by Xiao Ni

[permalink] [raw]

Subject: Re: [PATCH md-6.9 v2 09/10] md/raid1: factor out the code to manage sequential IO

On Tue, Feb 27, 2024 at 8:09 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> There is no functional change for now, make read_balance() cleaner and
> prepare to fix problems and refactor the handler of sequential IO.
>
> Co-developed-by: Paul Luse <[email protected]>
> Signed-off-by: Paul Luse <[email protected]>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/raid1.c | 71 ++++++++++++++++++++++++----------------------
> 1 file changed, 37 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 76bb59ad1485..d3e9a0157437 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -705,6 +705,31 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> return bb_disk;
> }
>
> +static bool is_sequential(struct r1conf *conf, int disk, struct r1bio *r1_bio)
> +{
> + /* TODO: address issues with this check and concurrency. */
> + return conf->mirrors[disk].next_seq_sect == r1_bio->sector ||
> + conf->mirrors[disk].head_position == r1_bio->sector;
> +}
> +
> +/*
> + * If buffered sequential IO size exceeds optimal iosize, check if there is idle
> + * disk. If yes, choose the idle disk.
> + */
> +static bool should_choose_next(struct r1conf *conf, int disk)
> +{
> + struct raid1_info *mirror = &conf->mirrors[disk];
> + int opt_iosize;
> +
> + if (!test_bit(Nonrot, &mirror->rdev->flags))
> + return false;
> +
> + opt_iosize = bdev_io_opt(mirror->rdev->bdev) >> 9;
> + return opt_iosize > 0 && mirror->seq_start != MaxSector &&
> + mirror->next_seq_sect > opt_iosize &&
> + mirror->next_seq_sect - opt_iosize >= mirror->seq_start;
> +}
> +
> /*
> * This routine returns the disk from which the requested read should
> * be done. There is a per-array 'next expected sequential IO' sector
> @@ -768,43 +793,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> pending = atomic_read(&rdev->nr_pending);
> dist = abs(this_sector - conf->mirrors[disk].head_position);
> /* Don't change to another disk for sequential reads */
> - if (conf->mirrors[disk].next_seq_sect == this_sector
> - || dist == 0) {
> - int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
> - struct raid1_info *mirror = &conf->mirrors[disk];
> -
> - /*
> - * If buffered sequential IO size exceeds optimal
> - * iosize, check if there is idle disk. If yes, choose
> - * the idle disk. read_balance could already choose an
> - * idle disk before noticing it's a sequential IO in
> - * this disk. This doesn't matter because this disk
> - * will idle, next time it will be utilized after the
> - * first disk has IO size exceeds optimal iosize. In
> - * this way, iosize of the first disk will be optimal
> - * iosize at least. iosize of the second disk might be
> - * small, but not a big deal since when the second disk
> - * starts IO, the first disk is likely still busy.
> - */
> - if (test_bit(Nonrot, &rdev->flags) && opt_iosize > 0 &&
> - mirror->seq_start != MaxSector &&
> - mirror->next_seq_sect > opt_iosize &&
> - mirror->next_seq_sect - opt_iosize >=
> - mirror->seq_start) {
> - /*
> - * Add 'pending' to avoid choosing this disk if
> - * there is other idle disk.
> - */
> - pending++;
> - /*
> - * If there is no other idle disk, this disk
> - * will be chosen.
> - */
> - sequential_disk = disk;
> - } else {
> + if (is_sequential(conf, disk, r1_bio)) {
> + if (!should_choose_next(conf, disk)) {
> best_disk = disk;
> break;
> }
> + /*
> + * Add 'pending' to avoid choosing this disk if
> + * there is other idle disk.
> + */
> + pending++;
> + /*
> + * If there is no other idle disk, this disk
> + * will be chosen.
> + */
> + sequential_disk = disk;
> }
>
> if (min_pending > pending) {
> --
> 2.39.2
>
Hi all
This patch looks good to me.
Reviewed-by: Xiao Ni <[email protected]>

2024-02-28 02:38:02

by Xiao Ni

[permalink] [raw]

Subject: Re: [PATCH md-6.9 v2 10/10] md/raid1: factor out helpers to choose the best rdev from read_balance()

On Tue, Feb 27, 2024 at 8:10 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> The way that best rdev is chosen:
>
> 1) If the read is sequential from one rdev:
> - if rdev is rotational, use this rdev;
> - if rdev is non-rotational, use this rdev until total read length
> exceed disk opt io size;
>
> 2) If the read is not sequential:
> - if there is idle disk, use it, otherwise:
> - if the array has non-rotational disk, choose the rdev with minimal
> inflight IO;
> - if all the underlaying disks are rotational disk, choose the rdev
> with closest IO;
>
> There are no functional changes, just to make code cleaner and prepare
> for following refactor.
>
> Co-developed-by: Paul Luse <[email protected]>
> Signed-off-by: Paul Luse <[email protected]>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/raid1.c | 175 +++++++++++++++++++++++++--------------------
> 1 file changed, 98 insertions(+), 77 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index d3e9a0157437..1bdd59d9e6ba 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -730,74 +730,71 @@ static bool should_choose_next(struct r1conf *conf, int disk)
> mirror->next_seq_sect - opt_iosize >= mirror->seq_start;
> }
>
> -/*
> - * This routine returns the disk from which the requested read should
> - * be done. There is a per-array 'next expected sequential IO' sector
> - * number - if this matches on the next IO then we use the last disk.
> - * There is also a per-disk 'last know head position' sector that is
> - * maintained from IRQ contexts, both the normal and the resync IO
> - * completion handlers update this position correctly. If there is no
> - * perfect sequential match then we pick the disk whose head is closest.
> - *
> - * If there are 2 mirrors in the same 2 devices, performance degrades
> - * because position is mirror, not device based.
> - *
> - * The rdev for the device selected will have nr_pending incremented.
> - */
> -static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sectors)
> +static bool rdev_readable(struct md_rdev *rdev, struct r1bio *r1_bio)
> {
> - const sector_t this_sector = r1_bio->sector;
> - int sectors;
> - int best_good_sectors;
> - int best_disk, best_dist_disk, best_pending_disk, sequential_disk;
> - int disk;
> - sector_t best_dist;
> - unsigned int min_pending;
> - struct md_rdev *rdev;
> + if (!rdev || test_bit(Faulty, &rdev->flags))
> + return false;
>
> - retry:
> - sectors = r1_bio->sectors;
> - best_disk = -1;
> - best_dist_disk = -1;
> - sequential_disk = -1;
> - best_dist = MaxSector;
> - best_pending_disk = -1;
> - min_pending = UINT_MAX;
> - best_good_sectors = 0;
> - clear_bit(R1BIO_FailFast, &r1_bio->state);
> + /* still in recovery */
> + if (!test_bit(In_sync, &rdev->flags) &&
> + rdev->recovery_offset < r1_bio->sector + r1_bio->sectors)
> + return false;
>
> - if (raid1_should_read_first(conf->mddev, this_sector, sectors))
> - return choose_first_rdev(conf, r1_bio, max_sectors);
> + /* don't read from slow disk unless have to */
> + if (test_bit(WriteMostly, &rdev->flags))
> + return false;
> +
> + /* don't split IO for bad blocks unless have to */
> + if (rdev_has_badblock(rdev, r1_bio->sector, r1_bio->sectors))
> + return false;
> +
> + return true;
> +}
> +
> +struct read_balance_ctl {
> + sector_t closest_dist;
> + int closest_dist_disk;
> + int min_pending;
> + int min_pending_disk;
> + int sequential_disk;
> + int readable_disks;
> +};
> +
> +static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
> +{
> + int disk;
> + struct read_balance_ctl ctl = {
> + .closest_dist_disk = -1,
> + .closest_dist = MaxSector,
> + .min_pending_disk = -1,
> + .min_pending = UINT_MAX,
> + .sequential_disk = -1,
> + };
>
> for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
> + struct md_rdev *rdev;
> sector_t dist;
> unsigned int pending;
>
> - rdev = conf->mirrors[disk].rdev;
> - if (r1_bio->bios[disk] == IO_BLOCKED
> - || rdev == NULL
> - || test_bit(Faulty, &rdev->flags))
> - continue;
> - if (!test_bit(In_sync, &rdev->flags) &&
> - rdev->recovery_offset < this_sector + sectors)
> - continue;
> - if (test_bit(WriteMostly, &rdev->flags))
> + if (r1_bio->bios[disk] == IO_BLOCKED)
> continue;
> - if (rdev_has_badblock(rdev, this_sector, sectors))
> +
> + rdev = conf->mirrors[disk].rdev;
> + if (!rdev_readable(rdev, r1_bio))
> continue;
>
> - if (best_disk >= 0)
> - /* At least two disks to choose from so failfast is OK */
> + /* At least two disks to choose from so failfast is OK */
> + if (ctl.readable_disks++ == 1)
> set_bit(R1BIO_FailFast, &r1_bio->state);
>
> pending = atomic_read(&rdev->nr_pending);
> - dist = abs(this_sector - conf->mirrors[disk].head_position);
> + dist = abs(r1_bio->sector - conf->mirrors[disk].head_position);
> +
> /* Don't change to another disk for sequential reads */
> if (is_sequential(conf, disk, r1_bio)) {
> - if (!should_choose_next(conf, disk)) {
> - best_disk = disk;
> - break;
> - }
> + if (!should_choose_next(conf, disk))
> + return disk;
> +
> /*
> * Add 'pending' to avoid choosing this disk if
> * there is other idle disk.
> @@ -807,17 +804,17 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> * If there is no other idle disk, this disk
> * will be chosen.
> */
> - sequential_disk = disk;
> + ctl.sequential_disk = disk;
> }
>
> - if (min_pending > pending) {
> - min_pending = pending;
> - best_pending_disk = disk;
> + if (ctl.min_pending > pending) {
> + ctl.min_pending = pending;
> + ctl.min_pending_disk = disk;
> }
>
> - if (dist < best_dist) {
> - best_dist = dist;
> - best_dist_disk = disk;
> + if (ctl.closest_dist > dist) {
> + ctl.closest_dist = dist;
> + ctl.closest_dist_disk = disk;
> }
> }
>
> @@ -825,8 +822,8 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> * sequential IO size exceeds optimal iosize, however, there is no other
> * idle disk, so choose the sequential disk.
> */
> - if (best_disk == -1 && min_pending != 0)
> - best_disk = sequential_disk;
> + if (ctl.sequential_disk != -1 && ctl.min_pending != 0)
> + return ctl.sequential_disk;
>
> /*
> * If all disks are rotational, choose the closest disk. If any disk is
> @@ -834,25 +831,49 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> * disk is rotational, which might/might not be optimal for raids with
> * mixed ratation/non-rotational disks depending on workload.
> */
> - if (best_disk == -1) {
> - if (READ_ONCE(conf->nonrot_disks) || min_pending == 0)
> - best_disk = best_pending_disk;
> - else
> - best_disk = best_dist_disk;
> - }
> + if (ctl.min_pending_disk != -1 &&
> + (READ_ONCE(conf->nonrot_disks) || ctl.min_pending == 0))
> + return ctl.min_pending_disk;
> + else
> + return ctl.closest_dist_disk;
> +}
>
> - if (best_disk >= 0) {
> - rdev = conf->mirrors[best_disk].rdev;
> - if (!rdev)
> - goto retry;
> +/*
> + * This routine returns the disk from which the requested read should be done.
> + *
> + * 1) If resync is in progress, find the first usable disk and use it even if it
> + * has some bad blocks.
> + *
> + * 2) Now that there is no resync, loop through all disks and skipping slow
> + * disks and disks with bad blocks for now. Only pay attention to key disk
> + * choice.
> + *
> + * 3) If we've made it this far, now look for disks with bad blocks and choose
> + * the one with most number of sectors.
> + *
> + * 4) If we are all the way at the end, we have no choice but to use a disk even
> + * if it is write mostly.
> + *
> + * The rdev for the device selected will have nr_pending incremented.
> + */
> +static int read_balance(struct r1conf *conf, struct r1bio *r1_bio,
> + int *max_sectors)
> +{
> + int disk;
>
> - sectors = best_good_sectors;
> - update_read_sectors(conf, disk, this_sector, sectors);
> - }
> - *max_sectors = sectors;
> + clear_bit(R1BIO_FailFast, &r1_bio->state);
> +
> + if (raid1_should_read_first(conf->mddev, r1_bio->sector,
> + r1_bio->sectors))
> + return choose_first_rdev(conf, r1_bio, max_sectors);
>
> - if (best_disk >= 0)
> - return best_disk;
> + disk = choose_best_rdev(conf, r1_bio);
> + if (disk >= 0) {
> + *max_sectors = r1_bio->sectors;
> + update_read_sectors(conf, disk, r1_bio->sector,
> + r1_bio->sectors);
> + return disk;
> + }
>
> /*
> * If we are here it means we didn't find a perfectly good disk so
> --
> 2.39.2
>
Hi all
This patch looks good to me. Thanks.
Reviewed-by: Xiao Ni <[email protected]>

2024-02-28 02:40:34

by Yu Kuai

[permalink] [raw]

Subject: Re: [PATCH md-6.9 v2 02/10] md/raid1: record nonrot rdevs while adding/removing rdevs to conf

Hi,

在 2024/02/28 9:56, Xiao Ni 写道:
> On Tue, Feb 27, 2024 at 8:09 PM Yu Kuai <[email protected]> wrote:
>>
>> From: Yu Kuai <[email protected]>
>>
>> For raid1, each read will iterate all the rdevs from conf and check if
>> any rdev is non-rotational, then choose rdev with minimal IO inflight
>> if so, or rdev with closest distance otherwise.
>>
>> Disk nonrot info can be changed through sysfs entry:
>>
>> /sys/block/[disk_name]/queue/rotational
>>
>> However, consider that this should only be used for testing, and user
>> really shouldn't do this in real life. Record the number of non-rotational
>> disks in conf, to avoid checking each rdev in IO fast path and simplify
>> read_balance() a little bit.
>>
>> Co-developed-by: Paul Luse <[email protected]>
>> Signed-off-by: Paul Luse <[email protected]>
>> Signed-off-by: Yu Kuai <[email protected]>
>> ---
>> drivers/md/md.h | 1 +
>> drivers/md/raid1.c | 17 ++++++++++-------
>> drivers/md/raid1.h | 1 +
>> 3 files changed, 12 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index a49ab04ab707..b2076a165c10 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -207,6 +207,7 @@ enum flag_bits {
>> * check if there is collision between raid1
>> * serial bios.
>> */
>> + Nonrot, /* non-rotational device (SSD) */
>> };
>>
>> static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index a145fe48b9ce..0fed01b06de9 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -599,7 +599,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> int sectors;
>> int best_good_sectors;
>> int best_disk, best_dist_disk, best_pending_disk;
>> - int has_nonrot_disk;
>> int disk;
>> sector_t best_dist;
>> unsigned int min_pending;
>> @@ -620,7 +619,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> best_pending_disk = -1;
>> min_pending = UINT_MAX;
>> best_good_sectors = 0;
>> - has_nonrot_disk = 0;
>> choose_next_idle = 0;
>> clear_bit(R1BIO_FailFast, &r1_bio->state);
>>
>> @@ -637,7 +635,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> sector_t first_bad;
>> int bad_sectors;
>> unsigned int pending;
>> - bool nonrot;
>>
>> rdev = conf->mirrors[disk].rdev;
>> if (r1_bio->bios[disk] == IO_BLOCKED
>> @@ -703,8 +700,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> /* At least two disks to choose from so failfast is OK */
>> set_bit(R1BIO_FailFast, &r1_bio->state);
>>
>> - nonrot = bdev_nonrot(rdev->bdev);
>> - has_nonrot_disk |= nonrot;
>> pending = atomic_read(&rdev->nr_pending);
>> dist = abs(this_sector - conf->mirrors[disk].head_position);
>> if (choose_first) {
>> @@ -731,7 +726,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> * small, but not a big deal since when the second disk
>> * starts IO, the first disk is likely still busy.
>> */
>> - if (nonrot && opt_iosize > 0 &&
>> + if (test_bit(Nonrot, &rdev->flags) && opt_iosize > 0 &&
>> mirror->seq_start != MaxSector &&
>> mirror->next_seq_sect > opt_iosize &&
>> mirror->next_seq_sect - opt_iosize >=
>> @@ -763,7 +758,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> * mixed ratation/non-rotational disks depending on workload.
>> */
>> if (best_disk == -1) {
>> - if (has_nonrot_disk || min_pending == 0)
>> + if (READ_ONCE(conf->nonrot_disks) || min_pending == 0)
>> best_disk = best_pending_disk;
>> else
>> best_disk = best_dist_disk;
>> @@ -1819,6 +1814,11 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
>> WRITE_ONCE(p[conf->raid_disks].rdev, rdev);
>> }
>>
>> + if (!err && bdev_nonrot(rdev->bdev)) {
>> + set_bit(Nonrot, &rdev->flags);
>> + WRITE_ONCE(conf->nonrot_disks, conf->nonrot_disks + 1);
>> + }
>> +
>
> Hi Kuai
>
> I noticed raid1_run->setup_conf is used to add rdev to conf when
> creating raid1. raid1_add_disk is only used for --add/--re-add after
> creating array. So we need to add the same logic in setup_conf?

Yes, it's right. I'll add a helper raid1_add_conf(raid1_info, rdev) and
raid1_remove_conf(raid1_info, rdev) to do this, make sure all the places
to modify conf is covered.

Thanks,
Kuai

>
> Regards
> Xiao
>> print_conf(conf);
>> return err;
>> }
>> @@ -1883,6 +1883,9 @@ static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
>> }
>> abort:
>>
>> + if (test_and_clear_bit(Nonrot, &rdev->flags))
>> + WRITE_ONCE(conf->nonrot_disks, conf->nonrot_disks - 1);
>> +
>> print_conf(conf);
>> return err;
>> }
>> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
>> index 14d4211a123a..5300cbaa58a4 100644
>> --- a/drivers/md/raid1.h
>> +++ b/drivers/md/raid1.h
>> @@ -71,6 +71,7 @@ struct r1conf {
>> * allow for replacements.
>> */
>> int raid_disks;
>> + int nonrot_disks;
>>
>> spinlock_t device_lock;
>>
>> --
>> 2.39.2
>>
>
> .
>