From: Yu Kuai <[email protected]>
The orignial idea is that Paul want to optimize raid1 read
performance([1]), however, we think that the orignial code for
read_balance() is quite complex, and we don't want to add more
complexity. Hence we decide to refactor read_balance() first, to make
code cleaner and easier for follow up.
Before this patchset, read_balance() has many local variables and many
braches, it want to consider all the scenarios in one iteration. The
idea of this patch is to devide them into 4 different steps:
1) If resync is in progress, find the first usable disk, patch 5;
Otherwise:
2) Loop through all disks and skipping slow disks and disks with bad
blocks, choose the best disk, patch 10. If no disk is found:
3) Look for disks with bad blocks and choose the one with most number of
sectors, patch 8. If no disk is found:
4) Choose first found slow disk with no bad blocks, or slow disk with
most number of sectors, patch 7.
Note that step 3) and step 4) are super code path, and performance
should not be considered.
And after this patchset, we'll continue to optimize read_balance for
step 2), specifically how to choose the best rdev to read.
[1] https://lore.kernel.org/all/[email protected]/
Yu Kuai (10):
md: add a new helper rdev_has_badblock()
md: record nonrot rdevs while adding/removing rdevs to conf
md/raid1: fix choose next idle in read_balance()
md/raid1-10: add a helper raid1_check_read_range()
md/raid1-10: factor out a new helper raid1_should_read_first()
md/raid1: factor out read_first_rdev() from read_balance()
md/raid1: factor out choose_slow_rdev() from read_balance()
md/raid1: factor out choose_bb_rdev() from read_balance()
md/raid1: factor out the code to manage sequential IO
md/raid1: factor out helpers to choose the best rdev from
read_balance()
drivers/md/md.c | 28 ++-
drivers/md/md.h | 12 ++
drivers/md/raid1-10.c | 69 +++++++
drivers/md/raid1.c | 454 ++++++++++++++++++++++++------------------
drivers/md/raid10.c | 66 ++----
drivers/md/raid5.c | 35 ++--
6 files changed, 402 insertions(+), 262 deletions(-)
--
2.39.2
From: Yu Kuai <[email protected]>
The current api is_badblock() must pass in 'first_bad' and
'bad_sectors', however, many caller just want to know if there are
badblocks or not, and these caller must define two local variable that
will never be used.
Add a new helper rdev_has_badblock() that will only return if there are
badblocks or not, remove unnecessary local variables and replace
is_badblock() with the new helper in many places.
There are no functional changes, and the new helper will also be used
later to refactor read_balance().
Co-developed-by: Paul Luse <[email protected]>
Signed-off-by: Paul Luse <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.h | 10 ++++++++++
drivers/md/raid1.c | 26 +++++++-------------------
drivers/md/raid10.c | 45 ++++++++++++++-------------------------------
drivers/md/raid5.c | 35 +++++++++++++----------------------
4 files changed, 44 insertions(+), 72 deletions(-)
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 8d881cc59799..a49ab04ab707 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -222,6 +222,16 @@ static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
}
return 0;
}
+
+static inline int rdev_has_badblock(struct md_rdev *rdev, sector_t s,
+ int sectors)
+{
+ sector_t first_bad;
+ int bad_sectors;
+
+ return is_badblock(rdev, s, sectors, &first_bad, &bad_sectors);
+}
+
extern int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
int is_new);
extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 286f8b16c7bd..a145fe48b9ce 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -498,9 +498,6 @@ static void raid1_end_write_request(struct bio *bio)
* to user-side. So if something waits for IO, then it
* will wait for the 'master' bio.
*/
- sector_t first_bad;
- int bad_sectors;
-
r1_bio->bios[mirror] = NULL;
to_put = bio;
/*
@@ -516,8 +513,8 @@ static void raid1_end_write_request(struct bio *bio)
set_bit(R1BIO_Uptodate, &r1_bio->state);
/* Maybe we can clear some bad blocks. */
- if (is_badblock(rdev, r1_bio->sector, r1_bio->sectors,
- &first_bad, &bad_sectors) && !discard_error) {
+ if (rdev_has_badblock(rdev, r1_bio->sector, r1_bio->sectors) &&
+ !discard_error) {
r1_bio->bios[mirror] = IO_MADE_GOOD;
set_bit(R1BIO_MadeGood, &r1_bio->state);
}
@@ -1944,8 +1941,6 @@ static void end_sync_write(struct bio *bio)
struct r1bio *r1_bio = get_resync_r1bio(bio);
struct mddev *mddev = r1_bio->mddev;
struct r1conf *conf = mddev->private;
- sector_t first_bad;
- int bad_sectors;
struct md_rdev *rdev = conf->mirrors[find_bio_disk(r1_bio, bio)].rdev;
if (!uptodate) {
@@ -1955,14 +1950,11 @@ static void end_sync_write(struct bio *bio)
set_bit(MD_RECOVERY_NEEDED, &
mddev->recovery);
set_bit(R1BIO_WriteError, &r1_bio->state);
- } else if (is_badblock(rdev, r1_bio->sector, r1_bio->sectors,
- &first_bad, &bad_sectors) &&
- !is_badblock(conf->mirrors[r1_bio->read_disk].rdev,
- r1_bio->sector,
- r1_bio->sectors,
- &first_bad, &bad_sectors)
- )
+ } else if (rdev_has_badblock(rdev, r1_bio->sector, r1_bio->sectors) &&
+ !rdev_has_badblock(conf->mirrors[r1_bio->read_disk].rdev,
+ r1_bio->sector, r1_bio->sectors)) {
set_bit(R1BIO_MadeGood, &r1_bio->state);
+ }
put_sync_write_buf(r1_bio, uptodate);
}
@@ -2279,16 +2271,12 @@ static void fix_read_error(struct r1conf *conf, struct r1bio *r1_bio)
s = PAGE_SIZE >> 9;
do {
- sector_t first_bad;
- int bad_sectors;
-
rdev = conf->mirrors[d].rdev;
if (rdev &&
(test_bit(In_sync, &rdev->flags) ||
(!test_bit(Faulty, &rdev->flags) &&
rdev->recovery_offset >= sect + s)) &&
- is_badblock(rdev, sect, s,
- &first_bad, &bad_sectors) == 0) {
+ rdev_has_badblock(rdev, sect, s) == 0) {
atomic_inc(&rdev->nr_pending);
if (sync_page_io(rdev, sect, s<<9,
conf->tmppage, REQ_OP_READ, false))
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 7412066ea22c..d5a7a621f0f0 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -518,11 +518,7 @@ static void raid10_end_write_request(struct bio *bio)
* The 'master' represents the composite IO operation to
* user-side. So if something waits for IO, then it will
* wait for the 'master' bio.
- */
- sector_t first_bad;
- int bad_sectors;
-
- /*
+ *
* Do not set R10BIO_Uptodate if the current device is
* rebuilding or Faulty. This is because we cannot use
* such device for properly reading the data back (we could
@@ -535,10 +531,9 @@ static void raid10_end_write_request(struct bio *bio)
set_bit(R10BIO_Uptodate, &r10_bio->state);
/* Maybe we can clear some bad blocks. */
- if (is_badblock(rdev,
- r10_bio->devs[slot].addr,
- r10_bio->sectors,
- &first_bad, &bad_sectors) && !discard_error) {
+ if (rdev_has_badblock(rdev, r10_bio->devs[slot].addr,
+ r10_bio->sectors) &&
+ !discard_error) {
bio_put(bio);
if (repl)
r10_bio->devs[slot].repl_bio = IO_MADE_GOOD;
@@ -1330,10 +1325,7 @@ static void wait_blocked_dev(struct mddev *mddev, struct r10bio *r10_bio)
}
if (rdev && test_bit(WriteErrorSeen, &rdev->flags)) {
- sector_t first_bad;
sector_t dev_sector = r10_bio->devs[i].addr;
- int bad_sectors;
- int is_bad;
/*
* Discard request doesn't care the write result
@@ -1342,9 +1334,8 @@ static void wait_blocked_dev(struct mddev *mddev, struct r10bio *r10_bio)
if (!r10_bio->sectors)
continue;
- is_bad = is_badblock(rdev, dev_sector, r10_bio->sectors,
- &first_bad, &bad_sectors);
- if (is_bad < 0) {
+ if (rdev_has_badblock(rdev, dev_sector,
+ r10_bio->sectors) < 0) {
/*
* Mustn't write here until the bad block
* is acknowledged
@@ -2290,8 +2281,6 @@ static void end_sync_write(struct bio *bio)
struct mddev *mddev = r10_bio->mddev;
struct r10conf *conf = mddev->private;
int d;
- sector_t first_bad;
- int bad_sectors;
int slot;
int repl;
struct md_rdev *rdev = NULL;
@@ -2312,11 +2301,10 @@ static void end_sync_write(struct bio *bio)
&rdev->mddev->recovery);
set_bit(R10BIO_WriteError, &r10_bio->state);
}
- } else if (is_badblock(rdev,
- r10_bio->devs[slot].addr,
- r10_bio->sectors,
- &first_bad, &bad_sectors))
+ } else if (rdev_has_badblock(rdev, r10_bio->devs[slot].addr,
+ r10_bio->sectors)) {
set_bit(R10BIO_MadeGood, &r10_bio->state);
+ }
rdev_dec_pending(rdev, mddev);
@@ -2597,11 +2585,8 @@ static void recovery_request_write(struct mddev *mddev, struct r10bio *r10_bio)
static int r10_sync_page_io(struct md_rdev *rdev, sector_t sector,
int sectors, struct page *page, enum req_op op)
{
- sector_t first_bad;
- int bad_sectors;
-
- if (is_badblock(rdev, sector, sectors, &first_bad, &bad_sectors)
- && (op == REQ_OP_READ || test_bit(WriteErrorSeen, &rdev->flags)))
+ if (rdev_has_badblock(rdev, sector, sectors) &&
+ (op == REQ_OP_READ || test_bit(WriteErrorSeen, &rdev->flags)))
return -1;
if (sync_page_io(rdev, sector, sectors << 9, page, op, false))
/* success */
@@ -2658,16 +2643,14 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
s = PAGE_SIZE >> 9;
do {
- sector_t first_bad;
- int bad_sectors;
-
d = r10_bio->devs[sl].devnum;
rdev = conf->mirrors[d].rdev;
if (rdev &&
test_bit(In_sync, &rdev->flags) &&
!test_bit(Faulty, &rdev->flags) &&
- is_badblock(rdev, r10_bio->devs[sl].addr + sect, s,
- &first_bad, &bad_sectors) == 0) {
+ rdev_has_badblock(rdev,
+ r10_bio->devs[sl].addr + sect,
+ s) == 0) {
atomic_inc(&rdev->nr_pending);
success = sync_page_io(rdev,
r10_bio->devs[sl].addr +
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 14f2cf75abbd..9241e95ef55c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1210,10 +1210,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
*/
while (op_is_write(op) && rdev &&
test_bit(WriteErrorSeen, &rdev->flags)) {
- sector_t first_bad;
- int bad_sectors;
- int bad = is_badblock(rdev, sh->sector, RAID5_STRIPE_SECTORS(conf),
- &first_bad, &bad_sectors);
+ int bad = rdev_has_badblock(rdev, sh->sector,
+ RAID5_STRIPE_SECTORS(conf));
if (!bad)
break;
@@ -2855,8 +2853,6 @@ static void raid5_end_write_request(struct bio *bi)
struct r5conf *conf = sh->raid_conf;
int disks = sh->disks, i;
struct md_rdev *rdev;
- sector_t first_bad;
- int bad_sectors;
int replacement = 0;
for (i = 0 ; i < disks; i++) {
@@ -2888,9 +2884,8 @@ static void raid5_end_write_request(struct bio *bi)
if (replacement) {
if (bi->bi_status)
md_error(conf->mddev, rdev);
- else if (is_badblock(rdev, sh->sector,
- RAID5_STRIPE_SECTORS(conf),
- &first_bad, &bad_sectors))
+ else if (rdev_has_badblock(rdev, sh->sector,
+ RAID5_STRIPE_SECTORS(conf)))
set_bit(R5_MadeGoodRepl, &sh->dev[i].flags);
} else {
if (bi->bi_status) {
@@ -2900,9 +2895,8 @@ static void raid5_end_write_request(struct bio *bi)
if (!test_and_set_bit(WantReplacement, &rdev->flags))
set_bit(MD_RECOVERY_NEEDED,
&rdev->mddev->recovery);
- } else if (is_badblock(rdev, sh->sector,
- RAID5_STRIPE_SECTORS(conf),
- &first_bad, &bad_sectors)) {
+ } else if (rdev_has_badblock(rdev, sh->sector,
+ RAID5_STRIPE_SECTORS(conf))) {
set_bit(R5_MadeGood, &sh->dev[i].flags);
if (test_bit(R5_ReadError, &sh->dev[i].flags))
/* That was a successful write so make
@@ -4674,8 +4668,6 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
/* Now to look around and see what can be done */
for (i=disks; i--; ) {
struct md_rdev *rdev;
- sector_t first_bad;
- int bad_sectors;
int is_bad = 0;
dev = &sh->dev[i];
@@ -4719,8 +4711,8 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
rdev = conf->disks[i].replacement;
if (rdev && !test_bit(Faulty, &rdev->flags) &&
rdev->recovery_offset >= sh->sector + RAID5_STRIPE_SECTORS(conf) &&
- !is_badblock(rdev, sh->sector, RAID5_STRIPE_SECTORS(conf),
- &first_bad, &bad_sectors))
+ !rdev_has_badblock(rdev, sh->sector,
+ RAID5_STRIPE_SECTORS(conf)))
set_bit(R5_ReadRepl, &dev->flags);
else {
if (rdev && !test_bit(Faulty, &rdev->flags))
@@ -4733,8 +4725,8 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
if (rdev && test_bit(Faulty, &rdev->flags))
rdev = NULL;
if (rdev) {
- is_bad = is_badblock(rdev, sh->sector, RAID5_STRIPE_SECTORS(conf),
- &first_bad, &bad_sectors);
+ is_bad = rdev_has_badblock(rdev, sh->sector,
+ RAID5_STRIPE_SECTORS(conf));
if (s->blocked_rdev == NULL
&& (test_bit(Blocked, &rdev->flags)
|| is_bad < 0)) {
@@ -5463,8 +5455,8 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
struct r5conf *conf = mddev->private;
struct bio *align_bio;
struct md_rdev *rdev;
- sector_t sector, end_sector, first_bad;
- int bad_sectors, dd_idx;
+ sector_t sector, end_sector;
+ int dd_idx;
bool did_inc;
if (!in_chunk_boundary(mddev, raid_bio)) {
@@ -5493,8 +5485,7 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
atomic_inc(&rdev->nr_pending);
- if (is_badblock(rdev, sector, bio_sectors(raid_bio), &first_bad,
- &bad_sectors)) {
+ if (rdev_has_badblock(rdev, sector, bio_sectors(raid_bio))) {
rdev_dec_pending(rdev, mddev);
return 0;
}
--
2.39.2
From: Yu Kuai <[email protected]>
If resync is in progress, read_balance() should find the first usable
disk, otherwise, data could be inconsistent after resync is done. raid1
and raid10 implement the same checking, hence factor out the checking
to make code cleaner.
Noted that raid1 is using 'mddev->recovery_cp', which is updated after
all resync IO is done, while raid10 is using 'conf->next_resync', which
is inaccurate because raid10 update it before submitting resync IO.
Fortunately, raid10 read IO can't concurrent with resync IO, hence there
is no problem. And this patch also switch raid10 to use
'mddev->recovery_cp'.
Co-developed-by: Paul Luse <[email protected]>
Signed-off-by: Paul Luse <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid1-10.c | 20 ++++++++++++++++++++
drivers/md/raid1.c | 15 ++-------------
drivers/md/raid10.c | 13 ++-----------
3 files changed, 24 insertions(+), 24 deletions(-)
diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
index 9bc0f0022a6c..2ea1710a3b70 100644
--- a/drivers/md/raid1-10.c
+++ b/drivers/md/raid1-10.c
@@ -276,3 +276,23 @@ static inline int raid1_check_read_range(struct md_rdev *rdev,
*len = first_bad + bad_sectors - this_sector;
return 0;
}
+
+/*
+ * Check if read should choose the first rdev.
+ *
+ * Balance on the whole device if no resync is going on (recovery is ok) or
+ * below the resync window. Otherwise, take the first readable disk.
+ */
+static inline bool raid1_should_read_first(struct mddev *mddev,
+ sector_t this_sector, int len)
+{
+ if ((mddev->recovery_cp < this_sector + len))
+ return true;
+
+ if (mddev_is_clustered(mddev) &&
+ md_cluster_ops->area_resyncing(mddev, READ, this_sector,
+ this_sector + len))
+ return true;
+
+ return false;
+}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d0bc67e6d068..8089c569e84f 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -605,11 +605,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
struct md_rdev *rdev;
int choose_first;
- /*
- * Check if we can balance. We can balance on the whole
- * device if no resync is going on, or below the resync window.
- * We take the first readable disk when above the resync window.
- */
retry:
sectors = r1_bio->sectors;
best_disk = -1;
@@ -618,16 +613,10 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
best_pending_disk = -1;
min_pending = UINT_MAX;
best_good_sectors = 0;
+ choose_first = raid1_should_read_first(conf->mddev, this_sector,
+ sectors);
clear_bit(R1BIO_FailFast, &r1_bio->state);
- if ((conf->mddev->recovery_cp < this_sector + sectors) ||
- (mddev_is_clustered(conf->mddev) &&
- md_cluster_ops->area_resyncing(conf->mddev, READ, this_sector,
- this_sector + sectors)))
- choose_first = 1;
- else
- choose_first = 0;
-
for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
sector_t dist;
sector_t first_bad;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 1f6693e40e12..22bcc3ce11d3 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -747,17 +747,8 @@ static struct md_rdev *read_balance(struct r10conf *conf,
best_good_sectors = 0;
do_balance = 1;
clear_bit(R10BIO_FailFast, &r10_bio->state);
- /*
- * Check if we can balance. We can balance on the whole
- * device if no resync is going on (recovery is ok), or below
- * the resync window. We take the first readable disk when
- * above the resync window.
- */
- if ((conf->mddev->recovery_cp < MaxSector
- && (this_sector + sectors >= conf->next_resync)) ||
- (mddev_is_clustered(conf->mddev) &&
- md_cluster_ops->area_resyncing(conf->mddev, READ, this_sector,
- this_sector + sectors)))
+
+ if (raid1_should_read_first(conf->mddev, this_sector, sectors))
do_balance = 0;
for (slot = 0; slot < conf->copies ; slot++) {
--
2.39.2
From: Yu Kuai <[email protected]>
There is no functional change for now, make read_balance() cleaner and
prepare to fix problems and refactor the handler of sequential IO.
Co-developed-by: Paul Luse <[email protected]>
Signed-off-by: Paul Luse <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid1.c | 71 +++++++++++++++++++++++++---------------------
1 file changed, 38 insertions(+), 33 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 4694e0e71e36..223ef8d06f67 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -705,6 +705,31 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
return bb_disk;
}
+static bool is_sequential(struct r1conf *conf, int disk, struct r1bio *r1_bio)
+{
+ /* TODO: address issues with this check and concurrency. */
+ return conf->mirrors[disk].next_seq_sect == r1_bio->sector ||
+ conf->mirrors[disk].head_position == r1_bio->sector;
+}
+
+/*
+ * If buffered sequential IO size exceeds optimal iosize, check if there is idle
+ * disk. If yes, choose the idle disk.
+ */
+static bool should_choose_next(struct r1conf *conf, int disk)
+{
+ struct raid1_info *mirror = &conf->mirrors[disk];
+ int opt_iosize;
+
+ if (!test_bit(Nonrot, &mirror->rdev->flags))
+ return false;
+
+ opt_iosize = bdev_io_opt(mirror->rdev->bdev) >> 9;
+ return opt_iosize > 0 && mirror->seq_start != MaxSector &&
+ mirror->next_seq_sect > opt_iosize &&
+ mirror->next_seq_sect - opt_iosize >= mirror->seq_start;
+}
+
/*
* This routine returns the disk from which the requested read should
* be done. There is a per-array 'next expected sequential IO' sector
@@ -767,42 +792,22 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
pending = atomic_read(&rdev->nr_pending);
dist = abs(this_sector - conf->mirrors[disk].head_position);
/* Don't change to another disk for sequential reads */
- if (conf->mirrors[disk].next_seq_sect == this_sector
- || dist == 0) {
- int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
- struct raid1_info *mirror = &conf->mirrors[disk];
-
- /*
- * If buffered sequential IO size exceeds optimal
- * iosize, check if there is idle disk. If yes, choose
- * the idle disk. read_balance could already choose an
- * idle disk before noticing it's a sequential IO in
- * this disk. This doesn't matter because this disk
- * will idle, next time it will be utilized after the
- * first disk has IO size exceeds optimal iosize. In
- * this way, iosize of the first disk will be optimal
- * iosize at least. iosize of the second disk might be
- * small, but not a big deal since when the second disk
- * starts IO, the first disk is likely still busy.
- */
- if (test_bit(Nonrot, &rdev->flags) && opt_iosize > 0 &&
- mirror->seq_start != MaxSector &&
- mirror->next_seq_sect > opt_iosize &&
- mirror->next_seq_sect - opt_iosize >=
- mirror->seq_start) {
- /*
- * Add 'pending' to avoid choosing this disk if
- * there is other idle disk.
- * Set 'dist' to 0, so that if there is no other
- * idle disk and all disks are rotational, this
- * disk will still be chosen.
- */
- pending++;
- dist = 0;
- } else {
+ if (is_sequential(conf, disk, r1_bio)) {
+ if (!should_choose_next(conf, disk)) {
best_disk = disk;
break;
}
+
+ /*
+ * Add 'pending' to avoid choosing this disk if there is
+ * other idle disk.
+ */
+ pending++;
+ /*
+ * Set 'dist' to 0, so that if there is no other idle
+ * disk, this disk will still be chosen.
+ */
+ dist = 0;
}
if (min_pending > pending) {
--
2.39.2
From: Yu Kuai <[email protected]>
read_balance() is hard to understand because there are too many status
and branches, and it's overlong.
This patch factor out the case to read the rdev with bad blocks from
read_balance(), there are no functional changes.
Co-developed-by: Paul Luse <[email protected]>
Signed-off-by: Paul Luse <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid1.c | 79 ++++++++++++++++++++++++++++------------------
1 file changed, 48 insertions(+), 31 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index bc2f8fcbe5b3..4694e0e71e36 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -620,6 +620,44 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
return -1;
}
+static int choose_bb_rdev(struct r1conf *conf, struct r1bio *r1_bio,
+ int *max_sectors)
+{
+ sector_t this_sector = r1_bio->sector;
+ int best_disk = -1;
+ int best_len = 0;
+ int disk;
+
+ for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
+ struct md_rdev *rdev;
+ int len;
+ int read_len;
+
+ if (r1_bio->bios[disk] == IO_BLOCKED)
+ continue;
+
+ rdev = conf->mirrors[disk].rdev;
+ if (!rdev || test_bit(Faulty, &rdev->flags) ||
+ test_bit(WriteMostly, &rdev->flags))
+ continue;
+
+ /* keep track of the disk with the most readable sectors. */
+ len = r1_bio->sectors;
+ read_len = raid1_check_read_range(rdev, this_sector, &len);
+ if (read_len > best_len) {
+ best_disk = disk;
+ best_len = read_len;
+ }
+ }
+
+ if (best_disk != -1) {
+ *max_sectors = best_len;
+ update_read_sectors(conf, best_disk, this_sector, best_len);
+ }
+
+ return best_disk;
+}
+
static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
int *max_sectors)
{
@@ -707,8 +745,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
sector_t dist;
- sector_t first_bad;
- int bad_sectors;
unsigned int pending;
rdev = conf->mirrors[disk].rdev;
@@ -721,36 +757,8 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
continue;
if (test_bit(WriteMostly, &rdev->flags))
continue;
- /* This is a reasonable device to use. It might
- * even be best.
- */
- if (is_badblock(rdev, this_sector, sectors,
- &first_bad, &bad_sectors)) {
- if (best_dist < MaxSector)
- /* already have a better device */
- continue;
- if (first_bad <= this_sector) {
- /* cannot read here. If this is the 'primary'
- * device, then we must not read beyond
- * bad_sectors from another device..
- */
- bad_sectors -= (this_sector - first_bad);
- if (best_good_sectors > sectors)
- best_good_sectors = sectors;
-
- } else {
- sector_t good_sectors = first_bad - this_sector;
- if (good_sectors > best_good_sectors) {
- best_good_sectors = good_sectors;
- best_disk = disk;
- }
- }
+ if (rdev_has_badblock(rdev, this_sector, sectors))
continue;
- } else {
- if ((sectors > best_good_sectors) && (best_disk >= 0))
- best_disk = -1;
- best_good_sectors = sectors;
- }
if (best_disk >= 0)
/* At least two disks to choose from so failfast is OK */
@@ -834,6 +842,15 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
if (best_disk >= 0)
return best_disk;
+ /*
+ * If we are here it means we didn't find a perfectly good disk so
+ * now spend a bit more time trying to find one with the most good
+ * sectors.
+ */
+ disk = choose_bb_rdev(conf, r1_bio, max_sectors);
+ if (disk >= 0)
+ return disk;
+
return choose_slow_rdev(conf, r1_bio, max_sectors);
}
--
2.39.2
From: Yu Kuai <[email protected]>
The way that best rdev is chosen:
1) If the read is sequential from one rdev:
- if rdev is rotational, use this rdev;
- if rdev is non-rotational, use this rdev until total read length
exceed disk opt io size;
2) If the read is not sequential:
- if there is idle disk, use it, otherwise:
- if the array has non-rotational disk, choose the rdev with minimal
inflight IO;
- if all the underlaying disks are rotational disk, choose the rdev
with closest IO;
There are no functional changes, just to make code cleaner and prepare
for following refactor.
Co-developed-by: Paul Luse <[email protected]>
Signed-off-by: Paul Luse <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid1.c | 171 ++++++++++++++++++++++++---------------------
1 file changed, 92 insertions(+), 79 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 223ef8d06f67..938b0e0170df 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -730,73 +730,68 @@ static bool should_choose_next(struct r1conf *conf, int disk)
mirror->next_seq_sect - opt_iosize >= mirror->seq_start;
}
-/*
- * This routine returns the disk from which the requested read should
- * be done. There is a per-array 'next expected sequential IO' sector
- * number - if this matches on the next IO then we use the last disk.
- * There is also a per-disk 'last know head position' sector that is
- * maintained from IRQ contexts, both the normal and the resync IO
- * completion handlers update this position correctly. If there is no
- * perfect sequential match then we pick the disk whose head is closest.
- *
- * If there are 2 mirrors in the same 2 devices, performance degrades
- * because position is mirror, not device based.
- *
- * The rdev for the device selected will have nr_pending incremented.
- */
-static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sectors)
+static bool rdev_readable(struct md_rdev *rdev, struct r1bio *r1_bio)
{
- const sector_t this_sector = r1_bio->sector;
- int sectors;
- int best_good_sectors;
- int best_disk, best_dist_disk, best_pending_disk;
- int disk;
- sector_t best_dist;
- unsigned int min_pending;
- struct md_rdev *rdev;
+ if (!rdev || test_bit(Faulty, &rdev->flags))
+ return false;
- retry:
- sectors = r1_bio->sectors;
- best_disk = -1;
- best_dist_disk = -1;
- best_dist = MaxSector;
- best_pending_disk = -1;
- min_pending = UINT_MAX;
- best_good_sectors = 0;
- clear_bit(R1BIO_FailFast, &r1_bio->state);
+ /* still in recovery */
+ if (!test_bit(In_sync, &rdev->flags) &&
+ rdev->recovery_offset < r1_bio->sector + r1_bio->sectors)
+ return false;
- if (raid1_should_read_first(conf->mddev, this_sector, sectors))
- return choose_first_rdev(conf, r1_bio, max_sectors);
+ /* don't read from slow disk unless have to */
+ if (test_bit(WriteMostly, &rdev->flags))
+ return false;
+
+ /* don't split IO for bad blocks unless have to */
+ if (rdev_has_badblock(rdev, r1_bio->sector, r1_bio->sectors))
+ return false;
+
+ return true;
+}
+
+struct read_balance_ctl {
+ sector_t closest_dist;
+ int closest_dist_disk;
+ int min_pending;
+ int min_pending_disk;
+ int readable_disks;
+};
+
+static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
+{
+ int disk;
+ struct read_balance_ctl ctl = {
+ .closest_dist_disk = -1,
+ .closest_dist = MaxSector,
+ .min_pending_disk = -1,
+ .min_pending = UINT_MAX,
+ };
for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
+ struct md_rdev *rdev;
sector_t dist;
unsigned int pending;
- rdev = conf->mirrors[disk].rdev;
- if (r1_bio->bios[disk] == IO_BLOCKED
- || rdev == NULL
- || test_bit(Faulty, &rdev->flags))
- continue;
- if (!test_bit(In_sync, &rdev->flags) &&
- rdev->recovery_offset < this_sector + sectors)
- continue;
- if (test_bit(WriteMostly, &rdev->flags))
+ if (r1_bio->bios[disk] == IO_BLOCKED)
continue;
- if (rdev_has_badblock(rdev, this_sector, sectors))
+
+ rdev = conf->mirrors[disk].rdev;
+ if (!rdev_readable(rdev, r1_bio))
continue;
- if (best_disk >= 0)
- /* At least two disks to choose from so failfast is OK */
+ /* At least two disks to choose from so failfast is OK */
+ if (ctl.readable_disks++ == 1)
set_bit(R1BIO_FailFast, &r1_bio->state);
pending = atomic_read(&rdev->nr_pending);
- dist = abs(this_sector - conf->mirrors[disk].head_position);
+ dist = abs(r1_bio->sector - conf->mirrors[disk].head_position);
+
/* Don't change to another disk for sequential reads */
if (is_sequential(conf, disk, r1_bio)) {
- if (!should_choose_next(conf, disk)) {
- best_disk = disk;
- break;
- }
+ if (!should_choose_next(conf, disk))
+ return disk;
/*
* Add 'pending' to avoid choosing this disk if there is
@@ -810,42 +805,60 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
dist = 0;
}
- if (min_pending > pending) {
- min_pending = pending;
- best_pending_disk = disk;
+ if (ctl.min_pending > pending) {
+ ctl.min_pending = pending;
+ ctl.min_pending_disk = disk;
}
- if (dist < best_dist) {
- best_dist = dist;
- best_dist_disk = disk;
+ if (dist < ctl.closest_dist) {
+ ctl.closest_dist = dist;
+ ctl.closest_dist_disk = disk;
}
}
- /*
- * If all disks are rotational, choose the closest disk. If any disk is
- * non-rotational, choose the disk with less pending request even the
- * disk is rotational, which might/might not be optimal for raids with
- * mixed ratation/non-rotational disks depending on workload.
- */
- if (best_disk == -1) {
- if (conf->mddev->nonrot_disks || min_pending == 0)
- best_disk = best_pending_disk;
- else
- best_disk = best_dist_disk;
- }
- if (best_disk >= 0) {
- rdev = conf->mirrors[best_disk].rdev;
- if (!rdev)
- goto retry;
+ if (ctl.min_pending_disk != -1 &&
+ (conf->mddev->nonrot_disks || ctl.min_pending == 0))
+ return ctl.min_pending_disk;
+ else
+ return ctl.closest_dist_disk;
+}
- sectors = best_good_sectors;
- update_read_sectors(conf, disk, this_sector, sectors);
- }
- *max_sectors = sectors;
+/*
+ * This routine returns the disk from which the requested read should be done.
+ *
+ * 1) If resync is in progress, find the first usable disk and use
+ * it even if it has some bad blocks.
+ *
+ * 2) Now that there is no resync, loop through all disks and skipping slow
+ * disks and disks with bad blocks for now. Only pay attention to key disk
+ * choice.
+ *
+ * 3) If we've made it this far, now look for disks with bad blocks and choose
+ * the one with most number of sectors.
+ *
+ * 4) If we are all the way at the end, we have no choice but to use a disk even
+ * if it is write mostly.
+
+ * The rdev for the device selected will have nr_pending incremented.
+ */
+static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sectors)
+{
+ int disk;
+
+ clear_bit(R1BIO_FailFast, &r1_bio->state);
- if (best_disk >= 0)
- return best_disk;
+ if (raid1_should_read_first(conf->mddev, r1_bio->sector,
+ r1_bio->sectors))
+ return choose_first_rdev(conf, r1_bio, max_sectors);
+
+ disk = choose_best_rdev(conf, r1_bio);
+ if (disk >= 0) {
+ *max_sectors = r1_bio->sectors;
+ update_read_sectors(conf, disk, r1_bio->sector,
+ r1_bio->sectors);
+ return disk;
+ }
/*
* If we are here it means we didn't find a perfectly good disk so
--
2.39.2
From: Yu Kuai <[email protected]>
read_balance() is hard to understand because there are too many status
and branches, and it's overlong.
This patch factor out the case to read the first rdev from
read_balance(), there are no functional changes.
Co-developed-by: Paul Luse <[email protected]>
Signed-off-by: Paul Luse <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid1.c | 63 +++++++++++++++++++++++++++++++++-------------
1 file changed, 46 insertions(+), 17 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 8089c569e84f..08c45ca55a7e 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -579,6 +579,47 @@ static sector_t align_to_barrier_unit_end(sector_t start_sector,
return len;
}
+static void update_read_sectors(struct r1conf *conf, int disk,
+ sector_t this_sector, int len)
+{
+ struct raid1_info *info = &conf->mirrors[disk];
+
+ atomic_inc(&info->rdev->nr_pending);
+ if (info->next_seq_sect != this_sector)
+ info->seq_start = this_sector;
+ info->next_seq_sect = this_sector + len;
+}
+
+static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
+ int *max_sectors)
+{
+ sector_t this_sector = r1_bio->sector;
+ int len = r1_bio->sectors;
+ int disk;
+
+ for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
+ struct md_rdev *rdev;
+ int read_len;
+
+ if (r1_bio->bios[disk] == IO_BLOCKED)
+ continue;
+
+ rdev = conf->mirrors[disk].rdev;
+ if (!rdev || test_bit(Faulty, &rdev->flags))
+ continue;
+
+ /* choose the first disk even if it has some bad blocks. */
+ read_len = raid1_check_read_range(rdev, this_sector, &len);
+ if (read_len > 0) {
+ update_read_sectors(conf, disk, this_sector, read_len);
+ *max_sectors = read_len;
+ return disk;
+ }
+ }
+
+ return -1;
+}
+
/*
* This routine returns the disk from which the requested read should
* be done. There is a per-array 'next expected sequential IO' sector
@@ -603,7 +644,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
sector_t best_dist;
unsigned int min_pending;
struct md_rdev *rdev;
- int choose_first;
retry:
sectors = r1_bio->sectors;
@@ -613,10 +653,11 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
best_pending_disk = -1;
min_pending = UINT_MAX;
best_good_sectors = 0;
- choose_first = raid1_should_read_first(conf->mddev, this_sector,
- sectors);
clear_bit(R1BIO_FailFast, &r1_bio->state);
+ if (raid1_should_read_first(conf->mddev, this_sector, sectors))
+ return choose_first_rdev(conf, r1_bio, max_sectors);
+
for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
sector_t dist;
sector_t first_bad;
@@ -662,8 +703,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
* bad_sectors from another device..
*/
bad_sectors -= (this_sector - first_bad);
- if (choose_first && sectors > bad_sectors)
- sectors = bad_sectors;
if (best_good_sectors > sectors)
best_good_sectors = sectors;
@@ -673,8 +712,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
best_good_sectors = good_sectors;
best_disk = disk;
}
- if (choose_first)
- break;
}
continue;
} else {
@@ -689,10 +726,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
pending = atomic_read(&rdev->nr_pending);
dist = abs(this_sector - conf->mirrors[disk].head_position);
- if (choose_first) {
- best_disk = disk;
- break;
- }
/* Don't change to another disk for sequential reads */
if (conf->mirrors[disk].next_seq_sect == this_sector
|| dist == 0) {
@@ -760,13 +793,9 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
rdev = conf->mirrors[best_disk].rdev;
if (!rdev)
goto retry;
- atomic_inc(&rdev->nr_pending);
- sectors = best_good_sectors;
-
- if (conf->mirrors[best_disk].next_seq_sect != this_sector)
- conf->mirrors[best_disk].seq_start = this_sector;
- conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
+ sectors = best_good_sectors;
+ update_read_sectors(conf, disk, this_sector, sectors);
}
*max_sectors = sectors;
--
2.39.2
From: Yu Kuai <[email protected]>
Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
the case choose next idle in read_balance():
read_balance:
for_each_rdev
if(next_seq_sect == this_sector || disk == 0)
-> sequential reads
best_disk = disk;
if (...)
choose_next_idle = 1
continue;
for_each_rdev
-> iterate next rdev
if (pending == 0)
best_disk = disk;
-> choose the next idle disk
break;
if (choose_next_idle)
-> keep using this rdev if there are no other idle disk
contine
However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
remove the code:
- /* If device is idle, use it */
- if (pending == 0) {
- best_disk = disk;
- break;
- }
Hence choose next idle will never work now, fix this problem by
following:
1) don't set best_disk in this case, read_balance() will choose the best
disk after iterating all the disks;
2) add 'pending' so that other idle disk will be chosen;
3) set 'dist' to 0 so that if there is no other idle disk, and all disks
are rotational, this disk will still be chosen;
Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
Co-developed-by: Paul Luse <[email protected]>
Signed-off-by: Paul Luse <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid1.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index c60ea58ae8c5..d0bc67e6d068 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
unsigned int min_pending;
struct md_rdev *rdev;
int choose_first;
- int choose_next_idle;
/*
* Check if we can balance. We can balance on the whole
@@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
best_pending_disk = -1;
min_pending = UINT_MAX;
best_good_sectors = 0;
- choose_next_idle = 0;
clear_bit(R1BIO_FailFast, &r1_bio->state);
if ((conf->mddev->recovery_cp < this_sector + sectors) ||
@@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
struct raid1_info *mirror = &conf->mirrors[disk];
- best_disk = disk;
/*
* If buffered sequential IO size exceeds optimal
* iosize, check if there is idle disk. If yes, choose
@@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
mirror->next_seq_sect > opt_iosize &&
mirror->next_seq_sect - opt_iosize >=
mirror->seq_start) {
- choose_next_idle = 1;
- continue;
+ /*
+ * Add 'pending' to avoid choosing this disk if
+ * there is other idle disk.
+ * Set 'dist' to 0, so that if there is no other
+ * idle disk and all disks are rotational, this
+ * disk will still be chosen.
+ */
+ pending++;
+ dist = 0;
+ } else {
+ best_disk = disk;
+ break;
}
- break;
}
- if (choose_next_idle)
- continue;
-
if (min_pending > pending) {
min_pending = pending;
best_pending_disk = disk;
--
2.39.2
From: Yu Kuai <[email protected]>
For raid1/raid10, each read will iterate all the rdevs from conf and
check if any rdev is non-rotational, then choose rdev with minimal IO
inflight if so, or rdev with closest distance otherwise.
Disk nonrot info can be changed through sysfs entry:
/sys/block/[disk_name]/queue/rotational
However, consider that this should only be used for testing, and user
really shouldn't do this in real life. Record the number of non-rotational
disks in mddev, to avoid checking each rdev in IO fast path and simplify
read_balance() a little bit.
Co-developed-by: Paul Luse <[email protected]>
Signed-off-by: Paul Luse <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 28 ++++++++++++++++++++++++++--
drivers/md/md.h | 2 ++
drivers/md/raid1.c | 9 ++-------
drivers/md/raid10.c | 8 ++------
4 files changed, 32 insertions(+), 15 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index e2a5f513dbb7..9e671eec9309 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -146,6 +146,24 @@ static inline int speed_max(struct mddev *mddev)
mddev->sync_speed_max : sysctl_speed_limit_max;
}
+static void rdev_update_nonrot(struct md_rdev *rdev)
+{
+ if (!bdev_nonrot(rdev->bdev))
+ return;
+
+ set_bit(Nonrot, &rdev->flags);
+ WRITE_ONCE(rdev->mddev->nonrot_disks, rdev->mddev->nonrot_disks + 1);
+}
+
+static void rdev_clear_nonrot(struct md_rdev *rdev)
+{
+ if (!test_bit(Nonrot, &rdev->flags))
+ return;
+
+ clear_bit(Nonrot, &rdev->flags);
+ WRITE_ONCE(rdev->mddev->nonrot_disks, rdev->mddev->nonrot_disks - 1);
+}
+
static void rdev_uninit_serial(struct md_rdev *rdev)
{
if (!test_and_clear_bit(CollisionCheck, &rdev->flags))
@@ -2922,6 +2940,8 @@ static int add_bound_rdev(struct md_rdev *rdev)
md_kick_rdev_from_array(rdev);
return err;
}
+
+ rdev_update_nonrot(rdev);
}
sysfs_notify_dirent_safe(rdev->sysfs_state);
@@ -3271,8 +3291,10 @@ slot_store(struct md_rdev *rdev, const char *buf, size_t len)
if (err) {
rdev->raid_disk = -1;
return err;
- } else
- sysfs_notify_dirent_safe(rdev->sysfs_state);
+ }
+
+ rdev_update_nonrot(rdev);
+ sysfs_notify_dirent_safe(rdev->sysfs_state);
/* failure here is OK */;
sysfs_link_rdev(rdev->mddev, rdev);
/* don't wakeup anyone, leave that to userspace. */
@@ -9266,6 +9288,7 @@ static int remove_and_add_spares(struct mddev *mddev,
rdev_for_each(rdev, mddev) {
if ((this == NULL || rdev == this) && rdev_removeable(rdev) &&
!mddev->pers->hot_remove_disk(mddev, rdev)) {
+ rdev_clear_nonrot(rdev);
sysfs_unlink_rdev(mddev, rdev);
rdev->saved_raid_disk = rdev->raid_disk;
rdev->raid_disk = -1;
@@ -9289,6 +9312,7 @@ static int remove_and_add_spares(struct mddev *mddev,
if (!test_bit(Journal, &rdev->flags))
rdev->recovery_offset = 0;
if (mddev->pers->hot_add_disk(mddev, rdev) == 0) {
+ rdev_update_nonrot(rdev);
/* failure here is OK */
sysfs_link_rdev(mddev, rdev);
if (!test_bit(Journal, &rdev->flags))
diff --git a/drivers/md/md.h b/drivers/md/md.h
index a49ab04ab707..54aa951f2bba 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -207,6 +207,7 @@ enum flag_bits {
* check if there is collision between raid1
* serial bios.
*/
+ Nonrot, /* non-rotational device (SSD) */
};
static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
@@ -312,6 +313,7 @@ struct mddev {
unsigned long flags;
unsigned long sb_flags;
+ int nonrot_disks;
int suspended;
struct mutex suspend_mutex;
struct percpu_ref active_io;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a145fe48b9ce..c60ea58ae8c5 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -599,7 +599,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
int sectors;
int best_good_sectors;
int best_disk, best_dist_disk, best_pending_disk;
- int has_nonrot_disk;
int disk;
sector_t best_dist;
unsigned int min_pending;
@@ -620,7 +619,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
best_pending_disk = -1;
min_pending = UINT_MAX;
best_good_sectors = 0;
- has_nonrot_disk = 0;
choose_next_idle = 0;
clear_bit(R1BIO_FailFast, &r1_bio->state);
@@ -637,7 +635,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
sector_t first_bad;
int bad_sectors;
unsigned int pending;
- bool nonrot;
rdev = conf->mirrors[disk].rdev;
if (r1_bio->bios[disk] == IO_BLOCKED
@@ -703,8 +700,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
/* At least two disks to choose from so failfast is OK */
set_bit(R1BIO_FailFast, &r1_bio->state);
- nonrot = bdev_nonrot(rdev->bdev);
- has_nonrot_disk |= nonrot;
pending = atomic_read(&rdev->nr_pending);
dist = abs(this_sector - conf->mirrors[disk].head_position);
if (choose_first) {
@@ -731,7 +726,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
* small, but not a big deal since when the second disk
* starts IO, the first disk is likely still busy.
*/
- if (nonrot && opt_iosize > 0 &&
+ if (test_bit(Nonrot, &rdev->flags) && opt_iosize > 0 &&
mirror->seq_start != MaxSector &&
mirror->next_seq_sect > opt_iosize &&
mirror->next_seq_sect - opt_iosize >=
@@ -763,7 +758,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
* mixed ratation/non-rotational disks depending on workload.
*/
if (best_disk == -1) {
- if (has_nonrot_disk || min_pending == 0)
+ if (conf->mddev->nonrot_disks || min_pending == 0)
best_disk = best_pending_disk;
else
best_disk = best_dist_disk;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d5a7a621f0f0..1f6693e40e12 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -735,7 +735,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
struct md_rdev *best_dist_rdev, *best_pending_rdev, *rdev = NULL;
int do_balance;
int best_dist_slot, best_pending_slot;
- bool has_nonrot_disk = false;
unsigned int min_pending;
struct geom *geo = &conf->geo;
@@ -766,7 +765,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
int bad_sectors;
sector_t dev_sector;
unsigned int pending;
- bool nonrot;
if (r10_bio->devs[slot].bio == IO_BLOCKED)
continue;
@@ -818,10 +816,8 @@ static struct md_rdev *read_balance(struct r10conf *conf,
if (!do_balance)
break;
- nonrot = bdev_nonrot(rdev->bdev);
- has_nonrot_disk |= nonrot;
pending = atomic_read(&rdev->nr_pending);
- if (min_pending > pending && nonrot) {
+ if (min_pending > pending && test_bit(Nonrot, &rdev->flags)) {
min_pending = pending;
best_pending_slot = slot;
best_pending_rdev = rdev;
@@ -851,7 +847,7 @@ static struct md_rdev *read_balance(struct r10conf *conf,
}
}
if (slot >= conf->copies) {
- if (has_nonrot_disk) {
+ if (conf->mddev->nonrot_disks) {
slot = best_pending_slot;
rdev = best_pending_rdev;
} else {
--
2.39.2
Dear Kuai, dear Paul,
Thank you for your work. Some nits and request for benchmarks below.
Am 22.02.24 um 08:57 schrieb Yu Kuai:
> From: Yu Kuai <[email protected]>
>
> The orignial idea is that Paul want to optimize raid1 read
original
> performance([1]), however, we think that the orignial code for
original
> read_balance() is quite complex, and we don't want to add more
> complexity. Hence we decide to refactor read_balance() first, to make
> code cleaner and easier for follow up.
>
> Before this patchset, read_balance() has many local variables and many
> braches, it want to consider all the scenarios in one iteration. The
branches
> idea of this patch is to devide them into 4 different steps:
divide
> 1) If resync is in progress, find the first usable disk, patch 5;
> Otherwise:
> 2) Loop through all disks and skipping slow disks and disks with bad
> blocks, choose the best disk, patch 10. If no disk is found:
> 3) Look for disks with bad blocks and choose the one with most number of
> sectors, patch 8. If no disk is found:
> 4) Choose first found slow disk with no bad blocks, or slow disk with
> most number of sectors, patch 7.
>
> Note that step 3) and step 4) are super code path, and performance
> should not be considered.
>
> And after this patchset, we'll continue to optimize read_balance for
> step 2), specifically how to choose the best rdev to read.
Is there a change in performance with the current patch set? Is radi1
well enough covered by the test suite?
Kind regards,
Paul
> [1] https://lore.kernel.org/all/[email protected]/
>
> Yu Kuai (10):
> md: add a new helper rdev_has_badblock()
> md: record nonrot rdevs while adding/removing rdevs to conf
> md/raid1: fix choose next idle in read_balance()
> md/raid1-10: add a helper raid1_check_read_range()
> md/raid1-10: factor out a new helper raid1_should_read_first()
> md/raid1: factor out read_first_rdev() from read_balance()
> md/raid1: factor out choose_slow_rdev() from read_balance()
> md/raid1: factor out choose_bb_rdev() from read_balance()
> md/raid1: factor out the code to manage sequential IO
> md/raid1: factor out helpers to choose the best rdev from
> read_balance()
>
> drivers/md/md.c | 28 ++-
> drivers/md/md.h | 12 ++
> drivers/md/raid1-10.c | 69 +++++++
> drivers/md/raid1.c | 454 ++++++++++++++++++++++++------------------
> drivers/md/raid10.c | 66 ++----
> drivers/md/raid5.c | 35 ++--
> 6 files changed, 402 insertions(+), 262 deletions(-)
Hi,
在 2024/02/22 16:40, Paul Menzel 写道:
> Is there a change in performance with the current patch set? Is radi1
> well enough covered by the test suite?
Yes, there are no performance degradation, and mdadm tests passed. And
Paul Luse also ran fio mixed workload w/data integrity and it passes.
Thanks,
Kuai
> On Feb 22, 2024, at 2:08 AM, Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2024/02/22 16:40, Paul Menzel 写道:
>> Is there a change in performance with the current patch set? Is radi1 well enough covered by the test suite?
>
> Yes, there are no performance degradation, and mdadm tests passed. And
> Paul Luse also ran fio mixed workload w/data integrity and it passes.
>
> Thanks,
> Kuai
>
Kuai is correct, in my original perf improvement patch I included lots of results. For this set where we just refactored I checked performance to assure we didn't go downhill but didn't save the results as deltas were in the noise. After this series lands we will look at introducing performance improvements again and at that time results from a full performance sweep will be included.
For data integrity, 1 and 2 disk mirrors were ran overnight w/fio and crcr32 check - no issues.
To assure other code paths execute as they did before was a little trickier without a unit test framework but for those cases I did modify/un-modify the code several times to follow various code paths and assure they're working as expected (ie bad blocks, etc)
-Paul
Dear Paul, dear Kuai
Am 22.02.24 um 14:04 schrieb Luse, Paul E:
>> On Feb 22, 2024, at 2:08 AM, Yu Kuai <[email protected]> wrote:
>> 在 2024/02/22 16:40, Paul Menzel 写道:
>>> Is there a change in performance with the current patch set? Is
>>> radi1 well enough covered by the test suite?
>>
>> Yes, there are no performance degradation, and mdadm tests passed.
>> And Paul Luse also ran fio mixed workload w/data integrity and it
>> passes.
>
> Kuai is correct, in my original perf improvement patch I included
> lots of results. For this set where we just refactored I checked
> performance to assure we didn't go downhill but didn't save the
> results as deltas were in the noise. After this series lands we will
> look at introducing performance improvements again and at that time
> results from a full performance sweep will be included.
>
> For data integrity, 1 and 2 disk mirrors were ran overnight w/fio and
> crcr32 check - no issues.
>
> To assure other code paths execute as they did before was a little
> trickier without a unit test framework but for those cases I did
> modify/un-modify the code several times to follow various code paths
> and assure they're working as expected (ie bad blocks, etc)
Thank you very much for the elaborate response.
In our infrastructure, we often notice things improve, but we sometimes
also have the “feeling” that things get worse. As IO is so complex, I
find it always helpful to exactly note down the test setup and the run
tests. So thank you for responding.
Kind regards,
Paul
Hi Kuai
Thanks for the effort!
On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> the case choose next idle in read_balance():
>
> read_balance:
> for_each_rdev
> if(next_seq_sect == this_sector || disk == 0)
typo error: s/disk/dist/g
> -> sequential reads
> best_disk = disk;
> if (...)
> choose_next_idle = 1
> continue;
>
> for_each_rdev
> -> iterate next rdev
> if (pending == 0)
> best_disk = disk;
> -> choose the next idle disk
> break;
>
> if (choose_next_idle)
> -> keep using this rdev if there are no other idle disk
> continue
>
> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> remove the code:
>
> - /* If device is idle, use it */
> - if (pending == 0) {
> - best_disk = disk;
> - break;
> - }
>
> Hence choose next idle will never work now, fix this problem by
> following:
>
> 1) don't set best_disk in this case, read_balance() will choose the best
> disk after iterating all the disks;
> 2) add 'pending' so that other idle disk will be chosen;
> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
> are rotational, this disk will still be chosen;
>
> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> Co-developed-by: Paul Luse <[email protected]>
> Signed-off-by: Paul Luse <[email protected]>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/raid1.c | 21 ++++++++++++---------
> 1 file changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index c60ea58ae8c5..d0bc67e6d068 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> unsigned int min_pending;
> struct md_rdev *rdev;
> int choose_first;
> - int choose_next_idle;
>
> /*
> * Check if we can balance. We can balance on the whole
> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> best_pending_disk = -1;
> min_pending = UINT_MAX;
> best_good_sectors = 0;
> - choose_next_idle = 0;
> clear_bit(R1BIO_FailFast, &r1_bio->state);
>
> if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
> struct raid1_info *mirror = &conf->mirrors[disk];
>
> - best_disk = disk;
> /*
> * If buffered sequential IO size exceeds optimal
> * iosize, check if there is idle disk. If yes, choose
> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> mirror->next_seq_sect > opt_iosize &&
> mirror->next_seq_sect - opt_iosize >=
> mirror->seq_start) {
> - choose_next_idle = 1;
> - continue;
> + /*
> + * Add 'pending' to avoid choosing this disk if
> + * there is other idle disk.
> + * Set 'dist' to 0, so that if there is no other
> + * idle disk and all disks are rotational, this
> + * disk will still be chosen.
> + */
> + pending++;
> + dist = 0;
There is a problem. If all disks are not idle and there is a disk with
dist=0 before the seq disk, it can't read from the seq disk. It will
read from the first disk with dist=0. Maybe we can only add the codes
which are removed from 2e52d449bcec?
Best Regards
Xiao
> + } else {
> + best_disk = disk;
> + break;
> }
> - break;
> }
>
> - if (choose_next_idle)
> - continue;
> -
> if (min_pending > pending) {
> min_pending = pending;
> best_pending_disk = disk;
> --
> 2.39.2
>
>
On Thu, Feb 22, 2024 at 4:05 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> The current api is_badblock() must pass in 'first_bad' and
> 'bad_sectors', however, many caller just want to know if there are
> badblocks or not, and these caller must define two local variable that
> will never be used.
>
> Add a new helper rdev_has_badblock() that will only return if there are
> badblocks or not, remove unnecessary local variables and replace
> is_badblock() with the new helper in many places.
>
> There are no functional changes, and the new helper will also be used
> later to refactor read_balance().
>
> Co-developed-by: Paul Luse <[email protected]>
> Signed-off-by: Paul Luse <[email protected]>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/md.h | 10 ++++++++++
> drivers/md/raid1.c | 26 +++++++-------------------
> drivers/md/raid10.c | 45 ++++++++++++++-------------------------------
> drivers/md/raid5.c | 35 +++++++++++++----------------------
> 4 files changed, 44 insertions(+), 72 deletions(-)
>
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 8d881cc59799..a49ab04ab707 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -222,6 +222,16 @@ static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
> }
> return 0;
> }
> +
> +static inline int rdev_has_badblock(struct md_rdev *rdev, sector_t s,
> + int sectors)
> +{
> + sector_t first_bad;
> + int bad_sectors;
> +
> + return is_badblock(rdev, s, sectors, &first_bad, &bad_sectors);
> +}
> +
> extern int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
> int is_new);
> extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 286f8b16c7bd..a145fe48b9ce 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -498,9 +498,6 @@ static void raid1_end_write_request(struct bio *bio)
> * to user-side. So if something waits for IO, then it
> * will wait for the 'master' bio.
> */
> - sector_t first_bad;
> - int bad_sectors;
> -
> r1_bio->bios[mirror] = NULL;
> to_put = bio;
> /*
> @@ -516,8 +513,8 @@ static void raid1_end_write_request(struct bio *bio)
> set_bit(R1BIO_Uptodate, &r1_bio->state);
>
> /* Maybe we can clear some bad blocks. */
> - if (is_badblock(rdev, r1_bio->sector, r1_bio->sectors,
> - &first_bad, &bad_sectors) && !discard_error) {
> + if (rdev_has_badblock(rdev, r1_bio->sector, r1_bio->sectors) &&
> + !discard_error) {
> r1_bio->bios[mirror] = IO_MADE_GOOD;
> set_bit(R1BIO_MadeGood, &r1_bio->state);
> }
> @@ -1944,8 +1941,6 @@ static void end_sync_write(struct bio *bio)
> struct r1bio *r1_bio = get_resync_r1bio(bio);
> struct mddev *mddev = r1_bio->mddev;
> struct r1conf *conf = mddev->private;
> - sector_t first_bad;
> - int bad_sectors;
> struct md_rdev *rdev = conf->mirrors[find_bio_disk(r1_bio, bio)].rdev;
>
> if (!uptodate) {
> @@ -1955,14 +1950,11 @@ static void end_sync_write(struct bio *bio)
> set_bit(MD_RECOVERY_NEEDED, &
> mddev->recovery);
> set_bit(R1BIO_WriteError, &r1_bio->state);
> - } else if (is_badblock(rdev, r1_bio->sector, r1_bio->sectors,
> - &first_bad, &bad_sectors) &&
> - !is_badblock(conf->mirrors[r1_bio->read_disk].rdev,
> - r1_bio->sector,
> - r1_bio->sectors,
> - &first_bad, &bad_sectors)
> - )
> + } else if (rdev_has_badblock(rdev, r1_bio->sector, r1_bio->sectors) &&
> + !rdev_has_badblock(conf->mirrors[r1_bio->read_disk].rdev,
> + r1_bio->sector, r1_bio->sectors)) {
> set_bit(R1BIO_MadeGood, &r1_bio->state);
> + }
>
> put_sync_write_buf(r1_bio, uptodate);
> }
> @@ -2279,16 +2271,12 @@ static void fix_read_error(struct r1conf *conf, struct r1bio *r1_bio)
> s = PAGE_SIZE >> 9;
>
> do {
> - sector_t first_bad;
> - int bad_sectors;
> -
> rdev = conf->mirrors[d].rdev;
> if (rdev &&
> (test_bit(In_sync, &rdev->flags) ||
> (!test_bit(Faulty, &rdev->flags) &&
> rdev->recovery_offset >= sect + s)) &&
> - is_badblock(rdev, sect, s,
> - &first_bad, &bad_sectors) == 0) {
> + rdev_has_badblock(rdev, sect, s) == 0) {
> atomic_inc(&rdev->nr_pending);
> if (sync_page_io(rdev, sect, s<<9,
> conf->tmppage, REQ_OP_READ, false))
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 7412066ea22c..d5a7a621f0f0 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -518,11 +518,7 @@ static void raid10_end_write_request(struct bio *bio)
> * The 'master' represents the composite IO operation to
> * user-side. So if something waits for IO, then it will
> * wait for the 'master' bio.
> - */
> - sector_t first_bad;
> - int bad_sectors;
> -
> - /*
> + *
> * Do not set R10BIO_Uptodate if the current device is
> * rebuilding or Faulty. This is because we cannot use
> * such device for properly reading the data back (we could
> @@ -535,10 +531,9 @@ static void raid10_end_write_request(struct bio *bio)
> set_bit(R10BIO_Uptodate, &r10_bio->state);
>
> /* Maybe we can clear some bad blocks. */
> - if (is_badblock(rdev,
> - r10_bio->devs[slot].addr,
> - r10_bio->sectors,
> - &first_bad, &bad_sectors) && !discard_error) {
> + if (rdev_has_badblock(rdev, r10_bio->devs[slot].addr,
> + r10_bio->sectors) &&
> + !discard_error) {
> bio_put(bio);
> if (repl)
> r10_bio->devs[slot].repl_bio = IO_MADE_GOOD;
> @@ -1330,10 +1325,7 @@ static void wait_blocked_dev(struct mddev *mddev, struct r10bio *r10_bio)
> }
>
> if (rdev && test_bit(WriteErrorSeen, &rdev->flags)) {
> - sector_t first_bad;
> sector_t dev_sector = r10_bio->devs[i].addr;
> - int bad_sectors;
> - int is_bad;
>
> /*
> * Discard request doesn't care the write result
> @@ -1342,9 +1334,8 @@ static void wait_blocked_dev(struct mddev *mddev, struct r10bio *r10_bio)
> if (!r10_bio->sectors)
> continue;
>
> - is_bad = is_badblock(rdev, dev_sector, r10_bio->sectors,
> - &first_bad, &bad_sectors);
> - if (is_bad < 0) {
> + if (rdev_has_badblock(rdev, dev_sector,
> + r10_bio->sectors) < 0) {
> /*
> * Mustn't write here until the bad block
> * is acknowledged
> @@ -2290,8 +2281,6 @@ static void end_sync_write(struct bio *bio)
> struct mddev *mddev = r10_bio->mddev;
> struct r10conf *conf = mddev->private;
> int d;
> - sector_t first_bad;
> - int bad_sectors;
> int slot;
> int repl;
> struct md_rdev *rdev = NULL;
> @@ -2312,11 +2301,10 @@ static void end_sync_write(struct bio *bio)
> &rdev->mddev->recovery);
> set_bit(R10BIO_WriteError, &r10_bio->state);
> }
> - } else if (is_badblock(rdev,
> - r10_bio->devs[slot].addr,
> - r10_bio->sectors,
> - &first_bad, &bad_sectors))
> + } else if (rdev_has_badblock(rdev, r10_bio->devs[slot].addr,
> + r10_bio->sectors)) {
> set_bit(R10BIO_MadeGood, &r10_bio->state);
> + }
>
> rdev_dec_pending(rdev, mddev);
>
> @@ -2597,11 +2585,8 @@ static void recovery_request_write(struct mddev *mddev, struct r10bio *r10_bio)
> static int r10_sync_page_io(struct md_rdev *rdev, sector_t sector,
> int sectors, struct page *page, enum req_op op)
> {
> - sector_t first_bad;
> - int bad_sectors;
> -
> - if (is_badblock(rdev, sector, sectors, &first_bad, &bad_sectors)
> - && (op == REQ_OP_READ || test_bit(WriteErrorSeen, &rdev->flags)))
> + if (rdev_has_badblock(rdev, sector, sectors) &&
> + (op == REQ_OP_READ || test_bit(WriteErrorSeen, &rdev->flags)))
> return -1;
> if (sync_page_io(rdev, sector, sectors << 9, page, op, false))
> /* success */
> @@ -2658,16 +2643,14 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
> s = PAGE_SIZE >> 9;
>
> do {
> - sector_t first_bad;
> - int bad_sectors;
> -
> d = r10_bio->devs[sl].devnum;
> rdev = conf->mirrors[d].rdev;
> if (rdev &&
> test_bit(In_sync, &rdev->flags) &&
> !test_bit(Faulty, &rdev->flags) &&
> - is_badblock(rdev, r10_bio->devs[sl].addr + sect, s,
> - &first_bad, &bad_sectors) == 0) {
> + rdev_has_badblock(rdev,
> + r10_bio->devs[sl].addr + sect,
> + s) == 0) {
> atomic_inc(&rdev->nr_pending);
> success = sync_page_io(rdev,
> r10_bio->devs[sl].addr +
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 14f2cf75abbd..9241e95ef55c 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -1210,10 +1210,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
> */
> while (op_is_write(op) && rdev &&
> test_bit(WriteErrorSeen, &rdev->flags)) {
> - sector_t first_bad;
> - int bad_sectors;
> - int bad = is_badblock(rdev, sh->sector, RAID5_STRIPE_SECTORS(conf),
> - &first_bad, &bad_sectors);
> + int bad = rdev_has_badblock(rdev, sh->sector,
> + RAID5_STRIPE_SECTORS(conf));
> if (!bad)
> break;
>
> @@ -2855,8 +2853,6 @@ static void raid5_end_write_request(struct bio *bi)
> struct r5conf *conf = sh->raid_conf;
> int disks = sh->disks, i;
> struct md_rdev *rdev;
> - sector_t first_bad;
> - int bad_sectors;
> int replacement = 0;
>
> for (i = 0 ; i < disks; i++) {
> @@ -2888,9 +2884,8 @@ static void raid5_end_write_request(struct bio *bi)
> if (replacement) {
> if (bi->bi_status)
> md_error(conf->mddev, rdev);
> - else if (is_badblock(rdev, sh->sector,
> - RAID5_STRIPE_SECTORS(conf),
> - &first_bad, &bad_sectors))
> + else if (rdev_has_badblock(rdev, sh->sector,
> + RAID5_STRIPE_SECTORS(conf)))
> set_bit(R5_MadeGoodRepl, &sh->dev[i].flags);
> } else {
> if (bi->bi_status) {
> @@ -2900,9 +2895,8 @@ static void raid5_end_write_request(struct bio *bi)
> if (!test_and_set_bit(WantReplacement, &rdev->flags))
> set_bit(MD_RECOVERY_NEEDED,
> &rdev->mddev->recovery);
> - } else if (is_badblock(rdev, sh->sector,
> - RAID5_STRIPE_SECTORS(conf),
> - &first_bad, &bad_sectors)) {
> + } else if (rdev_has_badblock(rdev, sh->sector,
> + RAID5_STRIPE_SECTORS(conf))) {
> set_bit(R5_MadeGood, &sh->dev[i].flags);
> if (test_bit(R5_ReadError, &sh->dev[i].flags))
> /* That was a successful write so make
> @@ -4674,8 +4668,6 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
> /* Now to look around and see what can be done */
> for (i=disks; i--; ) {
> struct md_rdev *rdev;
> - sector_t first_bad;
> - int bad_sectors;
> int is_bad = 0;
>
> dev = &sh->dev[i];
> @@ -4719,8 +4711,8 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
> rdev = conf->disks[i].replacement;
> if (rdev && !test_bit(Faulty, &rdev->flags) &&
> rdev->recovery_offset >= sh->sector + RAID5_STRIPE_SECTORS(conf) &&
> - !is_badblock(rdev, sh->sector, RAID5_STRIPE_SECTORS(conf),
> - &first_bad, &bad_sectors))
> + !rdev_has_badblock(rdev, sh->sector,
> + RAID5_STRIPE_SECTORS(conf)))
> set_bit(R5_ReadRepl, &dev->flags);
> else {
> if (rdev && !test_bit(Faulty, &rdev->flags))
> @@ -4733,8 +4725,8 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
> if (rdev && test_bit(Faulty, &rdev->flags))
> rdev = NULL;
> if (rdev) {
> - is_bad = is_badblock(rdev, sh->sector, RAID5_STRIPE_SECTORS(conf),
> - &first_bad, &bad_sectors);
> + is_bad = rdev_has_badblock(rdev, sh->sector,
> + RAID5_STRIPE_SECTORS(conf));
> if (s->blocked_rdev == NULL
> && (test_bit(Blocked, &rdev->flags)
> || is_bad < 0)) {
> @@ -5463,8 +5455,8 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
> struct r5conf *conf = mddev->private;
> struct bio *align_bio;
> struct md_rdev *rdev;
> - sector_t sector, end_sector, first_bad;
> - int bad_sectors, dd_idx;
> + sector_t sector, end_sector;
> + int dd_idx;
> bool did_inc;
>
> if (!in_chunk_boundary(mddev, raid_bio)) {
> @@ -5493,8 +5485,7 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
>
> atomic_inc(&rdev->nr_pending);
>
> - if (is_badblock(rdev, sector, bio_sectors(raid_bio), &first_bad,
> - &bad_sectors)) {
> + if (rdev_has_badblock(rdev, sector, bio_sectors(raid_bio))) {
> rdev_dec_pending(rdev, mddev);
> return 0;
> }
> --
> 2.39.2
>
>
This patch looks good to me.
Reviewed-by: Xiao Ni <[email protected]>
Hi,
在 2024/02/26 16:55, Xiao Ni 写道:
> Hi Kuai
>
> Thanks for the effort!
>
> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
>>
>> From: Yu Kuai <[email protected]>
>>
>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
>> the case choose next idle in read_balance():
>>
>> read_balance:
>> for_each_rdev
>> if(next_seq_sect == this_sector || disk == 0)
>
> typo error: s/disk/dist/g
>
>> -> sequential reads
>> best_disk = disk;
>> if (...)
>> choose_next_idle = 1
>> continue;
>>
>> for_each_rdev
>> -> iterate next rdev
>> if (pending == 0)
>> best_disk = disk;
>> -> choose the next idle disk
>> break;
>>
>> if (choose_next_idle)
>> -> keep using this rdev if there are no other idle disk
>> continue
>>
>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>> remove the code:
>>
>> - /* If device is idle, use it */
>> - if (pending == 0) {
>> - best_disk = disk;
>> - break;
>> - }
>>
>> Hence choose next idle will never work now, fix this problem by
>> following:
>>
>> 1) don't set best_disk in this case, read_balance() will choose the best
>> disk after iterating all the disks;
>> 2) add 'pending' so that other idle disk will be chosen;
>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>> are rotational, this disk will still be chosen;
>>
>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>> Co-developed-by: Paul Luse <[email protected]>
>> Signed-off-by: Paul Luse <[email protected]>
>> Signed-off-by: Yu Kuai <[email protected]>
>> ---
>> drivers/md/raid1.c | 21 ++++++++++++---------
>> 1 file changed, 12 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index c60ea58ae8c5..d0bc67e6d068 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> unsigned int min_pending;
>> struct md_rdev *rdev;
>> int choose_first;
>> - int choose_next_idle;
>>
>> /*
>> * Check if we can balance. We can balance on the whole
>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> best_pending_disk = -1;
>> min_pending = UINT_MAX;
>> best_good_sectors = 0;
>> - choose_next_idle = 0;
>> clear_bit(R1BIO_FailFast, &r1_bio->state);
>>
>> if ((conf->mddev->recovery_cp < this_sector + sectors) ||
>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>> struct raid1_info *mirror = &conf->mirrors[disk];
>>
>> - best_disk = disk;
>> /*
>> * If buffered sequential IO size exceeds optimal
>> * iosize, check if there is idle disk. If yes, choose
>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> mirror->next_seq_sect > opt_iosize &&
>> mirror->next_seq_sect - opt_iosize >=
>> mirror->seq_start) {
>> - choose_next_idle = 1;
>> - continue;
>> + /*
>> + * Add 'pending' to avoid choosing this disk if
>> + * there is other idle disk.
>> + * Set 'dist' to 0, so that if there is no other
>> + * idle disk and all disks are rotational, this
>> + * disk will still be chosen.
>> + */
>> + pending++;
>> + dist = 0;
>
> There is a problem. If all disks are not idle and there is a disk with
> dist=0 before the seq disk, it can't read from the seq disk. It will
> read from the first disk with dist=0. Maybe we can only add the codes
> which are removed from 2e52d449bcec?
If there is a disk with disk=0, then best_dist_disk will be updated to
the disk, and best_dist will be updated to 0 already:
// in each iteration
if (dist < best_dist) {
best_dist = dist;
btest_disk_disk = disk;
}
In this case, best_dist will be set to the first disk with dist=0, and
at last, the disk will be chosen:
if (best_disk == -1) {
if (has_nonrot_disk || min_pending == 0)
best_disk = best_pending_disk;
else
best_disk = best_dist_disk;
-> the first disk with dist=0;
}
So, the problem that you concerned should not exist.
Thanks,
Kuai
>
> Best Regards
> Xiao
>
>> + } else {
>> + best_disk = disk;
>> + break;
>> }
>> - break;
>> }
>>
>> - if (choose_next_idle)
>> - continue;
>> -
>> if (min_pending > pending) {
>> min_pending = pending;
>> best_pending_disk = disk;
>> --
>> 2.39.2
>>
>>
>
> .
>
On Mon, Feb 26, 2024 at 5:12 PM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2024/02/26 16:55, Xiao Ni 写道:
> > Hi Kuai
> >
> > Thanks for the effort!
> >
> > On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
> >>
> >> From: Yu Kuai <[email protected]>
> >>
> >> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> >> the case choose next idle in read_balance():
> >>
> >> read_balance:
> >> for_each_rdev
> >> if(next_seq_sect == this_sector || disk == 0)
> >
> > typo error: s/disk/dist/g
> >
> >> -> sequential reads
> >> best_disk = disk;
> >> if (...)
> >> choose_next_idle = 1
> >> continue;
> >>
> >> for_each_rdev
> >> -> iterate next rdev
> >> if (pending == 0)
> >> best_disk = disk;
> >> -> choose the next idle disk
> >> break;
> >>
> >> if (choose_next_idle)
> >> -> keep using this rdev if there are no other idle disk
> >> continue
> >>
> >> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >> remove the code:
> >>
> >> - /* If device is idle, use it */
> >> - if (pending == 0) {
> >> - best_disk = disk;
> >> - break;
> >> - }
> >>
> >> Hence choose next idle will never work now, fix this problem by
> >> following:
> >>
> >> 1) don't set best_disk in this case, read_balance() will choose the best
> >> disk after iterating all the disks;
> >> 2) add 'pending' so that other idle disk will be chosen;
> >> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
> >> are rotational, this disk will still be chosen;
> >>
> >> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >> Co-developed-by: Paul Luse <[email protected]>
> >> Signed-off-by: Paul Luse <[email protected]>
> >> Signed-off-by: Yu Kuai <[email protected]>
> >> ---
> >> drivers/md/raid1.c | 21 ++++++++++++---------
> >> 1 file changed, 12 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> index c60ea58ae8c5..d0bc67e6d068 100644
> >> --- a/drivers/md/raid1.c
> >> +++ b/drivers/md/raid1.c
> >> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> unsigned int min_pending;
> >> struct md_rdev *rdev;
> >> int choose_first;
> >> - int choose_next_idle;
> >>
> >> /*
> >> * Check if we can balance. We can balance on the whole
> >> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> best_pending_disk = -1;
> >> min_pending = UINT_MAX;
> >> best_good_sectors = 0;
> >> - choose_next_idle = 0;
> >> clear_bit(R1BIO_FailFast, &r1_bio->state);
> >>
> >> if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> >> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
> >> struct raid1_info *mirror = &conf->mirrors[disk];
> >>
> >> - best_disk = disk;
> >> /*
> >> * If buffered sequential IO size exceeds optimal
> >> * iosize, check if there is idle disk. If yes, choose
> >> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> mirror->next_seq_sect > opt_iosize &&
> >> mirror->next_seq_sect - opt_iosize >=
> >> mirror->seq_start) {
> >> - choose_next_idle = 1;
> >> - continue;
> >> + /*
> >> + * Add 'pending' to avoid choosing this disk if
> >> + * there is other idle disk.
> >> + * Set 'dist' to 0, so that if there is no other
> >> + * idle disk and all disks are rotational, this
> >> + * disk will still be chosen.
> >> + */
> >> + pending++;
> >> + dist = 0;
> >
> > There is a problem. If all disks are not idle and there is a disk with
> > dist=0 before the seq disk, it can't read from the seq disk. It will
> > read from the first disk with dist=0. Maybe we can only add the codes
> > which are removed from 2e52d449bcec?
>
> If there is a disk with disk=0, then best_dist_disk will be updated to
> the disk, and best_dist will be updated to 0 already:
>
> // in each iteration
> if (dist < best_dist) {
> best_dist = dist;
> btest_disk_disk = disk;
> }
>
> In this case, best_dist will be set to the first disk with dist=0, and
> at last, the disk will be chosen:
>
> if (best_disk == -1) {
> if (has_nonrot_disk || min_pending == 0)
> best_disk = best_pending_disk;
> else
> best_disk = best_dist_disk;
> -> the first disk with dist=0;
> }
>
> So, the problem that you concerned should not exist.
Hi Kuai
Thanks for the explanation. You're right. It chooses the first disk
which has dist==0. In the above, you made the same typo error disk=0
as the comment. I guess you want to use dist=0, right? Beside this,
this patch is good to me.
Best Regards
Xiao
>
> Thanks,
> Kuai
> >
> > Best Regards
> > Xiao
> >
> >> + } else {
> >> + best_disk = disk;
> >> + break;
> >> }
> >> - break;
> >> }
> >>
> >> - if (choose_next_idle)
> >> - continue;
> >> -
> >> if (min_pending > pending) {
> >> min_pending = pending;
> >> best_pending_disk = disk;
> >> --
> >> 2.39.2
> >>
> >>
> >
> > .
> >
>
Hi,
在 2024/02/26 17:24, Xiao Ni 写道:
> On Mon, Feb 26, 2024 at 5:12 PM Yu Kuai <[email protected]> wrote:
>>
>> Hi,
>>
>> 在 2024/02/26 16:55, Xiao Ni 写道:
>>> Hi Kuai
>>>
>>> Thanks for the effort!
>>>
>>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
>>>>
>>>> From: Yu Kuai <[email protected]>
>>>>
>>>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
>>>> the case choose next idle in read_balance():
>>>>
>>>> read_balance:
>>>> for_each_rdev
>>>> if(next_seq_sect == this_sector || disk == 0)
>>>
>>> typo error: s/disk/dist/g
>>>
>>>> -> sequential reads
>>>> best_disk = disk;
>>>> if (...)
>>>> choose_next_idle = 1
>>>> continue;
>>>>
>>>> for_each_rdev
>>>> -> iterate next rdev
>>>> if (pending == 0)
>>>> best_disk = disk;
>>>> -> choose the next idle disk
>>>> break;
>>>>
>>>> if (choose_next_idle)
>>>> -> keep using this rdev if there are no other idle disk
>>>> continue
>>>>
>>>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> remove the code:
>>>>
>>>> - /* If device is idle, use it */
>>>> - if (pending == 0) {
>>>> - best_disk = disk;
>>>> - break;
>>>> - }
>>>>
>>>> Hence choose next idle will never work now, fix this problem by
>>>> following:
>>>>
>>>> 1) don't set best_disk in this case, read_balance() will choose the best
>>>> disk after iterating all the disks;
>>>> 2) add 'pending' so that other idle disk will be chosen;
>>>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>>>> are rotational, this disk will still be chosen;
>>>>
>>>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> Co-developed-by: Paul Luse <[email protected]>
>>>> Signed-off-by: Paul Luse <[email protected]>
>>>> Signed-off-by: Yu Kuai <[email protected]>
>>>> ---
>>>> drivers/md/raid1.c | 21 ++++++++++++---------
>>>> 1 file changed, 12 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>> index c60ea58ae8c5..d0bc67e6d068 100644
>>>> --- a/drivers/md/raid1.c
>>>> +++ b/drivers/md/raid1.c
>>>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> unsigned int min_pending;
>>>> struct md_rdev *rdev;
>>>> int choose_first;
>>>> - int choose_next_idle;
>>>>
>>>> /*
>>>> * Check if we can balance. We can balance on the whole
>>>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> best_pending_disk = -1;
>>>> min_pending = UINT_MAX;
>>>> best_good_sectors = 0;
>>>> - choose_next_idle = 0;
>>>> clear_bit(R1BIO_FailFast, &r1_bio->state);
>>>>
>>>> if ((conf->mddev->recovery_cp < this_sector + sectors) ||
>>>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>>>> struct raid1_info *mirror = &conf->mirrors[disk];
>>>>
>>>> - best_disk = disk;
>>>> /*
>>>> * If buffered sequential IO size exceeds optimal
>>>> * iosize, check if there is idle disk. If yes, choose
>>>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> mirror->next_seq_sect > opt_iosize &&
>>>> mirror->next_seq_sect - opt_iosize >=
>>>> mirror->seq_start) {
>>>> - choose_next_idle = 1;
>>>> - continue;
>>>> + /*
>>>> + * Add 'pending' to avoid choosing this disk if
>>>> + * there is other idle disk.
>>>> + * Set 'dist' to 0, so that if there is no other
>>>> + * idle disk and all disks are rotational, this
>>>> + * disk will still be chosen.
>>>> + */
>>>> + pending++;
>>>> + dist = 0;
>>>
>>> There is a problem. If all disks are not idle and there is a disk with
>>> dist=0 before the seq disk, it can't read from the seq disk. It will
>>> read from the first disk with dist=0. Maybe we can only add the codes
>>> which are removed from 2e52d449bcec?
>>
>> If there is a disk with disk=0, then best_dist_disk will be updated to
>> the disk, and best_dist will be updated to 0 already:
>>
>> // in each iteration
>> if (dist < best_dist) {
>> best_dist = dist;
>> btest_disk_disk = disk;
>> }
>>
>> In this case, best_dist will be set to the first disk with dist=0, and
>> at last, the disk will be chosen:
>>
>> if (best_disk == -1) {
>> if (has_nonrot_disk || min_pending == 0)
>> best_disk = best_pending_disk;
>> else
>> best_disk = best_dist_disk;
>> -> the first disk with dist=0;
>> }
>>
>> So, the problem that you concerned should not exist.
>
> Hi Kuai
>
> Thanks for the explanation. You're right. It chooses the first disk
> which has dist==0. In the above, you made the same typo error disk=0
> as the comment. I guess you want to use dist=0, right? Beside this,
> this patch is good to me.
Yes, and Paul change the name 'best_dist' to 'closest_dist_disk',
and 'btest_disk_disk' to 'closest_dist' in the last patch to avoid typo
like this. :)
Thanks,
Kuai
>
> Best Regards
> Xiao
>>
>> Thanks,
>> Kuai
>>>
>>> Best Regards
>>> Xiao
>>>
>>>> + } else {
>>>> + best_disk = disk;
>>>> + break;
>>>> }
>>>> - break;
>>>> }
>>>>
>>>> - if (choose_next_idle)
>>>> - continue;
>>>> -
>>>> if (min_pending > pending) {
>>>> min_pending = pending;
>>>> best_pending_disk = disk;
>>>> --
>>>> 2.39.2
>>>>
>>>>
>>>
>>> .
>>>
>>
>
> .
>
Hi Kuai
I added some logs to check and add_bound_rdev can't be called when
creating raid device. Maybe move rdev_update_nonrot to
bind_rdev_to_array?
Regards
Xiao
On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> For raid1/raid10, each read will iterate all the rdevs from conf and
> check if any rdev is non-rotational, then choose rdev with minimal IO
> inflight if so, or rdev with closest distance otherwise.
>
> Disk nonrot info can be changed through sysfs entry:
>
> /sys/block/[disk_name]/queue/rotational
>
> However, consider that this should only be used for testing, and user
> really shouldn't do this in real life. Record the number of non-rotational
> disks in mddev, to avoid checking each rdev in IO fast path and simplify
> read_balance() a little bit.
>
> Co-developed-by: Paul Luse <[email protected]>
> Signed-off-by: Paul Luse <[email protected]>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/md.c | 28 ++++++++++++++++++++++++++--
> drivers/md/md.h | 2 ++
> drivers/md/raid1.c | 9 ++-------
> drivers/md/raid10.c | 8 ++------
> 4 files changed, 32 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index e2a5f513dbb7..9e671eec9309 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -146,6 +146,24 @@ static inline int speed_max(struct mddev *mddev)
> mddev->sync_speed_max : sysctl_speed_limit_max;
> }
>
> +static void rdev_update_nonrot(struct md_rdev *rdev)
> +{
> + if (!bdev_nonrot(rdev->bdev))
> + return;
> +
> + set_bit(Nonrot, &rdev->flags);
> + WRITE_ONCE(rdev->mddev->nonrot_disks, rdev->mddev->nonrot_disks + 1);
> +}
> +
> +static void rdev_clear_nonrot(struct md_rdev *rdev)
> +{
> + if (!test_bit(Nonrot, &rdev->flags))
> + return;
> +
> + clear_bit(Nonrot, &rdev->flags);
> + WRITE_ONCE(rdev->mddev->nonrot_disks, rdev->mddev->nonrot_disks - 1);
> +}
> +
> static void rdev_uninit_serial(struct md_rdev *rdev)
> {
> if (!test_and_clear_bit(CollisionCheck, &rdev->flags))
> @@ -2922,6 +2940,8 @@ static int add_bound_rdev(struct md_rdev *rdev)
> md_kick_rdev_from_array(rdev);
> return err;
> }
> +
> + rdev_update_nonrot(rdev);
> }
> sysfs_notify_dirent_safe(rdev->sysfs_state);
>
> @@ -3271,8 +3291,10 @@ slot_store(struct md_rdev *rdev, const char *buf, size_t len)
> if (err) {
> rdev->raid_disk = -1;
> return err;
> - } else
> - sysfs_notify_dirent_safe(rdev->sysfs_state);
> + }
> +
> + rdev_update_nonrot(rdev);
> + sysfs_notify_dirent_safe(rdev->sysfs_state);
> /* failure here is OK */;
> sysfs_link_rdev(rdev->mddev, rdev);
> /* don't wakeup anyone, leave that to userspace. */
> @@ -9266,6 +9288,7 @@ static int remove_and_add_spares(struct mddev *mddev,
> rdev_for_each(rdev, mddev) {
> if ((this == NULL || rdev == this) && rdev_removeable(rdev) &&
> !mddev->pers->hot_remove_disk(mddev, rdev)) {
> + rdev_clear_nonrot(rdev);
> sysfs_unlink_rdev(mddev, rdev);
> rdev->saved_raid_disk = rdev->raid_disk;
> rdev->raid_disk = -1;
> @@ -9289,6 +9312,7 @@ static int remove_and_add_spares(struct mddev *mddev,
> if (!test_bit(Journal, &rdev->flags))
> rdev->recovery_offset = 0;
> if (mddev->pers->hot_add_disk(mddev, rdev) == 0) {
> + rdev_update_nonrot(rdev);
> /* failure here is OK */
> sysfs_link_rdev(mddev, rdev);
> if (!test_bit(Journal, &rdev->flags))
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index a49ab04ab707..54aa951f2bba 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -207,6 +207,7 @@ enum flag_bits {
> * check if there is collision between raid1
> * serial bios.
> */
> + Nonrot, /* non-rotational device (SSD) */
> };
>
> static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
> @@ -312,6 +313,7 @@ struct mddev {
> unsigned long flags;
> unsigned long sb_flags;
>
> + int nonrot_disks;
> int suspended;
> struct mutex suspend_mutex;
> struct percpu_ref active_io;
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index a145fe48b9ce..c60ea58ae8c5 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -599,7 +599,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> int sectors;
> int best_good_sectors;
> int best_disk, best_dist_disk, best_pending_disk;
> - int has_nonrot_disk;
> int disk;
> sector_t best_dist;
> unsigned int min_pending;
> @@ -620,7 +619,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> best_pending_disk = -1;
> min_pending = UINT_MAX;
> best_good_sectors = 0;
> - has_nonrot_disk = 0;
> choose_next_idle = 0;
> clear_bit(R1BIO_FailFast, &r1_bio->state);
>
> @@ -637,7 +635,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> sector_t first_bad;
> int bad_sectors;
> unsigned int pending;
> - bool nonrot;
>
> rdev = conf->mirrors[disk].rdev;
> if (r1_bio->bios[disk] == IO_BLOCKED
> @@ -703,8 +700,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> /* At least two disks to choose from so failfast is OK */
> set_bit(R1BIO_FailFast, &r1_bio->state);
>
> - nonrot = bdev_nonrot(rdev->bdev);
> - has_nonrot_disk |= nonrot;
> pending = atomic_read(&rdev->nr_pending);
> dist = abs(this_sector - conf->mirrors[disk].head_position);
> if (choose_first) {
> @@ -731,7 +726,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> * small, but not a big deal since when the second disk
> * starts IO, the first disk is likely still busy.
> */
> - if (nonrot && opt_iosize > 0 &&
> + if (test_bit(Nonrot, &rdev->flags) && opt_iosize > 0 &&
> mirror->seq_start != MaxSector &&
> mirror->next_seq_sect > opt_iosize &&
> mirror->next_seq_sect - opt_iosize >=
> @@ -763,7 +758,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> * mixed ratation/non-rotational disks depending on workload.
> */
> if (best_disk == -1) {
> - if (has_nonrot_disk || min_pending == 0)
> + if (conf->mddev->nonrot_disks || min_pending == 0)
> best_disk = best_pending_disk;
> else
> best_disk = best_dist_disk;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index d5a7a621f0f0..1f6693e40e12 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -735,7 +735,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
> struct md_rdev *best_dist_rdev, *best_pending_rdev, *rdev = NULL;
> int do_balance;
> int best_dist_slot, best_pending_slot;
> - bool has_nonrot_disk = false;
> unsigned int min_pending;
> struct geom *geo = &conf->geo;
>
> @@ -766,7 +765,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
> int bad_sectors;
> sector_t dev_sector;
> unsigned int pending;
> - bool nonrot;
>
> if (r10_bio->devs[slot].bio == IO_BLOCKED)
> continue;
> @@ -818,10 +816,8 @@ static struct md_rdev *read_balance(struct r10conf *conf,
> if (!do_balance)
> break;
>
> - nonrot = bdev_nonrot(rdev->bdev);
> - has_nonrot_disk |= nonrot;
> pending = atomic_read(&rdev->nr_pending);
> - if (min_pending > pending && nonrot) {
> + if (min_pending > pending && test_bit(Nonrot, &rdev->flags)) {
> min_pending = pending;
> best_pending_slot = slot;
> best_pending_rdev = rdev;
> @@ -851,7 +847,7 @@ static struct md_rdev *read_balance(struct r10conf *conf,
> }
> }
> if (slot >= conf->copies) {
> - if (has_nonrot_disk) {
> + if (conf->mddev->nonrot_disks) {
> slot = best_pending_slot;
> rdev = best_pending_rdev;
> } else {
> --
> 2.39.2
>
>
On Mon, Feb 26, 2024 at 5:40 PM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2024/02/26 17:24, Xiao Ni 写道:
> > On Mon, Feb 26, 2024 at 5:12 PM Yu Kuai <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> 在 2024/02/26 16:55, Xiao Ni 写道:
> >>> Hi Kuai
> >>>
> >>> Thanks for the effort!
> >>>
> >>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
> >>>>
> >>>> From: Yu Kuai <[email protected]>
> >>>>
> >>>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> >>>> the case choose next idle in read_balance():
> >>>>
> >>>> read_balance:
> >>>> for_each_rdev
> >>>> if(next_seq_sect == this_sector || disk == 0)
> >>>
> >>> typo error: s/disk/dist/g
> >>>
> >>>> -> sequential reads
> >>>> best_disk = disk;
> >>>> if (...)
> >>>> choose_next_idle = 1
> >>>> continue;
> >>>>
> >>>> for_each_rdev
> >>>> -> iterate next rdev
> >>>> if (pending == 0)
> >>>> best_disk = disk;
> >>>> -> choose the next idle disk
> >>>> break;
> >>>>
> >>>> if (choose_next_idle)
> >>>> -> keep using this rdev if there are no other idle disk
> >>>> continue
> >>>>
> >>>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >>>> remove the code:
> >>>>
> >>>> - /* If device is idle, use it */
> >>>> - if (pending == 0) {
> >>>> - best_disk = disk;
> >>>> - break;
> >>>> - }
> >>>>
> >>>> Hence choose next idle will never work now, fix this problem by
> >>>> following:
> >>>>
> >>>> 1) don't set best_disk in this case, read_balance() will choose the best
> >>>> disk after iterating all the disks;
> >>>> 2) add 'pending' so that other idle disk will be chosen;
> >>>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
> >>>> are rotational, this disk will still be chosen;
> >>>>
> >>>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >>>> Co-developed-by: Paul Luse <[email protected]>
> >>>> Signed-off-by: Paul Luse <[email protected]>
> >>>> Signed-off-by: Yu Kuai <[email protected]>
> >>>> ---
> >>>> drivers/md/raid1.c | 21 ++++++++++++---------
> >>>> 1 file changed, 12 insertions(+), 9 deletions(-)
> >>>>
> >>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >>>> index c60ea58ae8c5..d0bc67e6d068 100644
> >>>> --- a/drivers/md/raid1.c
> >>>> +++ b/drivers/md/raid1.c
> >>>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>> unsigned int min_pending;
> >>>> struct md_rdev *rdev;
> >>>> int choose_first;
> >>>> - int choose_next_idle;
> >>>>
> >>>> /*
> >>>> * Check if we can balance. We can balance on the whole
> >>>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>> best_pending_disk = -1;
> >>>> min_pending = UINT_MAX;
> >>>> best_good_sectors = 0;
> >>>> - choose_next_idle = 0;
> >>>> clear_bit(R1BIO_FailFast, &r1_bio->state);
> >>>>
> >>>> if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> >>>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
> >>>> struct raid1_info *mirror = &conf->mirrors[disk];
> >>>>
> >>>> - best_disk = disk;
> >>>> /*
> >>>> * If buffered sequential IO size exceeds optimal
> >>>> * iosize, check if there is idle disk. If yes, choose
> >>>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>> mirror->next_seq_sect > opt_iosize &&
> >>>> mirror->next_seq_sect - opt_iosize >=
> >>>> mirror->seq_start) {
> >>>> - choose_next_idle = 1;
> >>>> - continue;
> >>>> + /*
> >>>> + * Add 'pending' to avoid choosing this disk if
> >>>> + * there is other idle disk.
> >>>> + * Set 'dist' to 0, so that if there is no other
> >>>> + * idle disk and all disks are rotational, this
> >>>> + * disk will still be chosen.
> >>>> + */
> >>>> + pending++;
> >>>> + dist = 0;
> >>>
> >>> There is a problem. If all disks are not idle and there is a disk with
> >>> dist=0 before the seq disk, it can't read from the seq disk. It will
> >>> read from the first disk with dist=0. Maybe we can only add the codes
> >>> which are removed from 2e52d449bcec?
> >>
> >> If there is a disk with disk=0, then best_dist_disk will be updated to
> >> the disk, and best_dist will be updated to 0 already:
> >>
> >> // in each iteration
> >> if (dist < best_dist) {
> >> best_dist = dist;
> >> btest_disk_disk = disk;
> >> }
> >>
> >> In this case, best_dist will be set to the first disk with dist=0, and
> >> at last, the disk will be chosen:
> >>
> >> if (best_disk == -1) {
> >> if (has_nonrot_disk || min_pending == 0)
> >> best_disk = best_pending_disk;
> >> else
> >> best_disk = best_dist_disk;
> >> -> the first disk with dist=0;
> >> }
> >>
> >> So, the problem that you concerned should not exist.
> >
> > Hi Kuai
> >
> > Thanks for the explanation. You're right. It chooses the first disk
> > which has dist==0. In the above, you made the same typo error disk=0
> > as the comment. I guess you want to use dist=0, right? Beside this,
> > this patch is good to me.
>
> Yes, and Paul change the name 'best_dist' to 'closest_dist_disk',
> and 'btest_disk_disk' to 'closest_dist' in the last patch to avoid typo
> like this. :)
Ah, thanks :) I haven't been there.
Regards
Xiao
>
> Thanks,
> Kuai
>
>
> >
> > Best Regards
> > Xiao
> >>
> >> Thanks,
> >> Kuai
> >>>
> >>> Best Regards
> >>> Xiao
> >>>
> >>>> + } else {
> >>>> + best_disk = disk;
> >>>> + break;
> >>>> }
> >>>> - break;
> >>>> }
> >>>>
> >>>> - if (choose_next_idle)
> >>>> - continue;
> >>>> -
> >>>> if (min_pending > pending) {
> >>>> min_pending = pending;
> >>>> best_pending_disk = disk;
> >>>> --
> >>>> 2.39.2
> >>>>
> >>>>
> >>>
> >>> .
> >>>
> >>
> >
> > .
> >
>
Hi,
在 2024/02/26 21:12, Xiao Ni 写道:
> Hi Kuai
>
> I added some logs to check and add_bound_rdev can't be called when
> creating raid device. Maybe move rdev_update_nonrot to
> bind_rdev_to_array?
bind_rdev_to_array() is used to add new rdev to the array, then
'pers->hot_add_disk' is used to add the rdev to conf. For new spares,
bind_rdev_to_array() will be called while 'pers->hot_add_disk' won't.
Hence rdev_update_nonrot() is used where 'pers->hot_add_disk' succeed
in this patch.
Perhaps it's better to move related code to raid1/raid10_add_disk() and
the new counter in raid1/raid10 conf?
Thanks,
Kuai
>
> Regards
> Xiao
>
> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
>>
>> From: Yu Kuai <[email protected]>
>>
>> For raid1/raid10, each read will iterate all the rdevs from conf and
>> check if any rdev is non-rotational, then choose rdev with minimal IO
>> inflight if so, or rdev with closest distance otherwise.
>>
>> Disk nonrot info can be changed through sysfs entry:
>>
>> /sys/block/[disk_name]/queue/rotational
>>
>> However, consider that this should only be used for testing, and user
>> really shouldn't do this in real life. Record the number of non-rotational
>> disks in mddev, to avoid checking each rdev in IO fast path and simplify
>> read_balance() a little bit.
>>
>> Co-developed-by: Paul Luse <[email protected]>
>> Signed-off-by: Paul Luse <[email protected]>
>> Signed-off-by: Yu Kuai <[email protected]>
>> ---
>> drivers/md/md.c | 28 ++++++++++++++++++++++++++--
>> drivers/md/md.h | 2 ++
>> drivers/md/raid1.c | 9 ++-------
>> drivers/md/raid10.c | 8 ++------
>> 4 files changed, 32 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index e2a5f513dbb7..9e671eec9309 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -146,6 +146,24 @@ static inline int speed_max(struct mddev *mddev)
>> mddev->sync_speed_max : sysctl_speed_limit_max;
>> }
>>
>> +static void rdev_update_nonrot(struct md_rdev *rdev)
>> +{
>> + if (!bdev_nonrot(rdev->bdev))
>> + return;
>> +
>> + set_bit(Nonrot, &rdev->flags);
>> + WRITE_ONCE(rdev->mddev->nonrot_disks, rdev->mddev->nonrot_disks + 1);
>> +}
>> +
>> +static void rdev_clear_nonrot(struct md_rdev *rdev)
>> +{
>> + if (!test_bit(Nonrot, &rdev->flags))
>> + return;
>> +
>> + clear_bit(Nonrot, &rdev->flags);
>> + WRITE_ONCE(rdev->mddev->nonrot_disks, rdev->mddev->nonrot_disks - 1);
>> +}
>> +
>> static void rdev_uninit_serial(struct md_rdev *rdev)
>> {
>> if (!test_and_clear_bit(CollisionCheck, &rdev->flags))
>> @@ -2922,6 +2940,8 @@ static int add_bound_rdev(struct md_rdev *rdev)
>> md_kick_rdev_from_array(rdev);
>> return err;
>> }
>> +
>> + rdev_update_nonrot(rdev);
>> }
>> sysfs_notify_dirent_safe(rdev->sysfs_state);
>>
>> @@ -3271,8 +3291,10 @@ slot_store(struct md_rdev *rdev, const char *buf, size_t len)
>> if (err) {
>> rdev->raid_disk = -1;
>> return err;
>> - } else
>> - sysfs_notify_dirent_safe(rdev->sysfs_state);
>> + }
>> +
>> + rdev_update_nonrot(rdev);
>> + sysfs_notify_dirent_safe(rdev->sysfs_state);
>> /* failure here is OK */;
>> sysfs_link_rdev(rdev->mddev, rdev);
>> /* don't wakeup anyone, leave that to userspace. */
>> @@ -9266,6 +9288,7 @@ static int remove_and_add_spares(struct mddev *mddev,
>> rdev_for_each(rdev, mddev) {
>> if ((this == NULL || rdev == this) && rdev_removeable(rdev) &&
>> !mddev->pers->hot_remove_disk(mddev, rdev)) {
>> + rdev_clear_nonrot(rdev);
>> sysfs_unlink_rdev(mddev, rdev);
>> rdev->saved_raid_disk = rdev->raid_disk;
>> rdev->raid_disk = -1;
>> @@ -9289,6 +9312,7 @@ static int remove_and_add_spares(struct mddev *mddev,
>> if (!test_bit(Journal, &rdev->flags))
>> rdev->recovery_offset = 0;
>> if (mddev->pers->hot_add_disk(mddev, rdev) == 0) {
>> + rdev_update_nonrot(rdev);
>> /* failure here is OK */
>> sysfs_link_rdev(mddev, rdev);
>> if (!test_bit(Journal, &rdev->flags))
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index a49ab04ab707..54aa951f2bba 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -207,6 +207,7 @@ enum flag_bits {
>> * check if there is collision between raid1
>> * serial bios.
>> */
>> + Nonrot, /* non-rotational device (SSD) */
>> };
>>
>> static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
>> @@ -312,6 +313,7 @@ struct mddev {
>> unsigned long flags;
>> unsigned long sb_flags;
>>
>> + int nonrot_disks;
>> int suspended;
>> struct mutex suspend_mutex;
>> struct percpu_ref active_io;
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index a145fe48b9ce..c60ea58ae8c5 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -599,7 +599,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> int sectors;
>> int best_good_sectors;
>> int best_disk, best_dist_disk, best_pending_disk;
>> - int has_nonrot_disk;
>> int disk;
>> sector_t best_dist;
>> unsigned int min_pending;
>> @@ -620,7 +619,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> best_pending_disk = -1;
>> min_pending = UINT_MAX;
>> best_good_sectors = 0;
>> - has_nonrot_disk = 0;
>> choose_next_idle = 0;
>> clear_bit(R1BIO_FailFast, &r1_bio->state);
>>
>> @@ -637,7 +635,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> sector_t first_bad;
>> int bad_sectors;
>> unsigned int pending;
>> - bool nonrot;
>>
>> rdev = conf->mirrors[disk].rdev;
>> if (r1_bio->bios[disk] == IO_BLOCKED
>> @@ -703,8 +700,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> /* At least two disks to choose from so failfast is OK */
>> set_bit(R1BIO_FailFast, &r1_bio->state);
>>
>> - nonrot = bdev_nonrot(rdev->bdev);
>> - has_nonrot_disk |= nonrot;
>> pending = atomic_read(&rdev->nr_pending);
>> dist = abs(this_sector - conf->mirrors[disk].head_position);
>> if (choose_first) {
>> @@ -731,7 +726,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> * small, but not a big deal since when the second disk
>> * starts IO, the first disk is likely still busy.
>> */
>> - if (nonrot && opt_iosize > 0 &&
>> + if (test_bit(Nonrot, &rdev->flags) && opt_iosize > 0 &&
>> mirror->seq_start != MaxSector &&
>> mirror->next_seq_sect > opt_iosize &&
>> mirror->next_seq_sect - opt_iosize >=
>> @@ -763,7 +758,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> * mixed ratation/non-rotational disks depending on workload.
>> */
>> if (best_disk == -1) {
>> - if (has_nonrot_disk || min_pending == 0)
>> + if (conf->mddev->nonrot_disks || min_pending == 0)
>> best_disk = best_pending_disk;
>> else
>> best_disk = best_dist_disk;
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index d5a7a621f0f0..1f6693e40e12 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -735,7 +735,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
>> struct md_rdev *best_dist_rdev, *best_pending_rdev, *rdev = NULL;
>> int do_balance;
>> int best_dist_slot, best_pending_slot;
>> - bool has_nonrot_disk = false;
>> unsigned int min_pending;
>> struct geom *geo = &conf->geo;
>>
>> @@ -766,7 +765,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
>> int bad_sectors;
>> sector_t dev_sector;
>> unsigned int pending;
>> - bool nonrot;
>>
>> if (r10_bio->devs[slot].bio == IO_BLOCKED)
>> continue;
>> @@ -818,10 +816,8 @@ static struct md_rdev *read_balance(struct r10conf *conf,
>> if (!do_balance)
>> break;
>>
>> - nonrot = bdev_nonrot(rdev->bdev);
>> - has_nonrot_disk |= nonrot;
>> pending = atomic_read(&rdev->nr_pending);
>> - if (min_pending > pending && nonrot) {
>> + if (min_pending > pending && test_bit(Nonrot, &rdev->flags)) {
>> min_pending = pending;
>> best_pending_slot = slot;
>> best_pending_rdev = rdev;
>> @@ -851,7 +847,7 @@ static struct md_rdev *read_balance(struct r10conf *conf,
>> }
>> }
>> if (slot >= conf->copies) {
>> - if (has_nonrot_disk) {
>> + if (conf->mddev->nonrot_disks) {
>> slot = best_pending_slot;
>> rdev = best_pending_rdev;
>> } else {
>> --
>> 2.39.2
>>
>>
>
> .
>
On Mon, Feb 26, 2024 at 9:25 PM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2024/02/26 21:12, Xiao Ni 写道:
> > Hi Kuai
> >
> > I added some logs to check and add_bound_rdev can't be called when
> > creating raid device. Maybe move rdev_update_nonrot to
> > bind_rdev_to_array?
>
> bind_rdev_to_array() is used to add new rdev to the array, then
> 'pers->hot_add_disk' is used to add the rdev to conf. For new spares,
> bind_rdev_to_array() will be called while 'pers->hot_add_disk' won't.
> Hence rdev_update_nonrot() is used where 'pers->hot_add_disk' succeed
> in this patch.
>
> Perhaps it's better to move related code to raid1/raid10_add_disk() and
> the new counter in raid1/raid10 conf?
Yes, this sounds better.
Regards
Xiao
>
> Thanks,
> Kuai
>
> >
> > Regards
> > Xiao
> >
> > On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
> >>
> >> From: Yu Kuai <[email protected]>
> >>
> >> For raid1/raid10, each read will iterate all the rdevs from conf and
> >> check if any rdev is non-rotational, then choose rdev with minimal IO
> >> inflight if so, or rdev with closest distance otherwise.
> >>
> >> Disk nonrot info can be changed through sysfs entry:
> >>
> >> /sys/block/[disk_name]/queue/rotational
> >>
> >> However, consider that this should only be used for testing, and user
> >> really shouldn't do this in real life. Record the number of non-rotational
> >> disks in mddev, to avoid checking each rdev in IO fast path and simplify
> >> read_balance() a little bit.
> >>
> >> Co-developed-by: Paul Luse <[email protected]>
> >> Signed-off-by: Paul Luse <[email protected]>
> >> Signed-off-by: Yu Kuai <[email protected]>
> >> ---
> >> drivers/md/md.c | 28 ++++++++++++++++++++++++++--
> >> drivers/md/md.h | 2 ++
> >> drivers/md/raid1.c | 9 ++-------
> >> drivers/md/raid10.c | 8 ++------
> >> 4 files changed, 32 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index e2a5f513dbb7..9e671eec9309 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -146,6 +146,24 @@ static inline int speed_max(struct mddev *mddev)
> >> mddev->sync_speed_max : sysctl_speed_limit_max;
> >> }
> >>
> >> +static void rdev_update_nonrot(struct md_rdev *rdev)
> >> +{
> >> + if (!bdev_nonrot(rdev->bdev))
> >> + return;
> >> +
> >> + set_bit(Nonrot, &rdev->flags);
> >> + WRITE_ONCE(rdev->mddev->nonrot_disks, rdev->mddev->nonrot_disks + 1);
> >> +}
> >> +
> >> +static void rdev_clear_nonrot(struct md_rdev *rdev)
> >> +{
> >> + if (!test_bit(Nonrot, &rdev->flags))
> >> + return;
> >> +
> >> + clear_bit(Nonrot, &rdev->flags);
> >> + WRITE_ONCE(rdev->mddev->nonrot_disks, rdev->mddev->nonrot_disks - 1);
> >> +}
> >> +
> >> static void rdev_uninit_serial(struct md_rdev *rdev)
> >> {
> >> if (!test_and_clear_bit(CollisionCheck, &rdev->flags))
> >> @@ -2922,6 +2940,8 @@ static int add_bound_rdev(struct md_rdev *rdev)
> >> md_kick_rdev_from_array(rdev);
> >> return err;
> >> }
> >> +
> >> + rdev_update_nonrot(rdev);
> >> }
> >> sysfs_notify_dirent_safe(rdev->sysfs_state);
> >>
> >> @@ -3271,8 +3291,10 @@ slot_store(struct md_rdev *rdev, const char *buf, size_t len)
> >> if (err) {
> >> rdev->raid_disk = -1;
> >> return err;
> >> - } else
> >> - sysfs_notify_dirent_safe(rdev->sysfs_state);
> >> + }
> >> +
> >> + rdev_update_nonrot(rdev);
> >> + sysfs_notify_dirent_safe(rdev->sysfs_state);
> >> /* failure here is OK */;
> >> sysfs_link_rdev(rdev->mddev, rdev);
> >> /* don't wakeup anyone, leave that to userspace. */
> >> @@ -9266,6 +9288,7 @@ static int remove_and_add_spares(struct mddev *mddev,
> >> rdev_for_each(rdev, mddev) {
> >> if ((this == NULL || rdev == this) && rdev_removeable(rdev) &&
> >> !mddev->pers->hot_remove_disk(mddev, rdev)) {
> >> + rdev_clear_nonrot(rdev);
> >> sysfs_unlink_rdev(mddev, rdev);
> >> rdev->saved_raid_disk = rdev->raid_disk;
> >> rdev->raid_disk = -1;
> >> @@ -9289,6 +9312,7 @@ static int remove_and_add_spares(struct mddev *mddev,
> >> if (!test_bit(Journal, &rdev->flags))
> >> rdev->recovery_offset = 0;
> >> if (mddev->pers->hot_add_disk(mddev, rdev) == 0) {
> >> + rdev_update_nonrot(rdev);
> >> /* failure here is OK */
> >> sysfs_link_rdev(mddev, rdev);
> >> if (!test_bit(Journal, &rdev->flags))
> >> diff --git a/drivers/md/md.h b/drivers/md/md.h
> >> index a49ab04ab707..54aa951f2bba 100644
> >> --- a/drivers/md/md.h
> >> +++ b/drivers/md/md.h
> >> @@ -207,6 +207,7 @@ enum flag_bits {
> >> * check if there is collision between raid1
> >> * serial bios.
> >> */
> >> + Nonrot, /* non-rotational device (SSD) */
> >> };
> >>
> >> static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
> >> @@ -312,6 +313,7 @@ struct mddev {
> >> unsigned long flags;
> >> unsigned long sb_flags;
> >>
> >> + int nonrot_disks;
> >> int suspended;
> >> struct mutex suspend_mutex;
> >> struct percpu_ref active_io;
> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> index a145fe48b9ce..c60ea58ae8c5 100644
> >> --- a/drivers/md/raid1.c
> >> +++ b/drivers/md/raid1.c
> >> @@ -599,7 +599,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> int sectors;
> >> int best_good_sectors;
> >> int best_disk, best_dist_disk, best_pending_disk;
> >> - int has_nonrot_disk;
> >> int disk;
> >> sector_t best_dist;
> >> unsigned int min_pending;
> >> @@ -620,7 +619,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> best_pending_disk = -1;
> >> min_pending = UINT_MAX;
> >> best_good_sectors = 0;
> >> - has_nonrot_disk = 0;
> >> choose_next_idle = 0;
> >> clear_bit(R1BIO_FailFast, &r1_bio->state);
> >>
> >> @@ -637,7 +635,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> sector_t first_bad;
> >> int bad_sectors;
> >> unsigned int pending;
> >> - bool nonrot;
> >>
> >> rdev = conf->mirrors[disk].rdev;
> >> if (r1_bio->bios[disk] == IO_BLOCKED
> >> @@ -703,8 +700,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> /* At least two disks to choose from so failfast is OK */
> >> set_bit(R1BIO_FailFast, &r1_bio->state);
> >>
> >> - nonrot = bdev_nonrot(rdev->bdev);
> >> - has_nonrot_disk |= nonrot;
> >> pending = atomic_read(&rdev->nr_pending);
> >> dist = abs(this_sector - conf->mirrors[disk].head_position);
> >> if (choose_first) {
> >> @@ -731,7 +726,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> * small, but not a big deal since when the second disk
> >> * starts IO, the first disk is likely still busy.
> >> */
> >> - if (nonrot && opt_iosize > 0 &&
> >> + if (test_bit(Nonrot, &rdev->flags) && opt_iosize > 0 &&
> >> mirror->seq_start != MaxSector &&
> >> mirror->next_seq_sect > opt_iosize &&
> >> mirror->next_seq_sect - opt_iosize >=
> >> @@ -763,7 +758,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> * mixed ratation/non-rotational disks depending on workload.
> >> */
> >> if (best_disk == -1) {
> >> - if (has_nonrot_disk || min_pending == 0)
> >> + if (conf->mddev->nonrot_disks || min_pending == 0)
> >> best_disk = best_pending_disk;
> >> else
> >> best_disk = best_dist_disk;
> >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> >> index d5a7a621f0f0..1f6693e40e12 100644
> >> --- a/drivers/md/raid10.c
> >> +++ b/drivers/md/raid10.c
> >> @@ -735,7 +735,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
> >> struct md_rdev *best_dist_rdev, *best_pending_rdev, *rdev = NULL;
> >> int do_balance;
> >> int best_dist_slot, best_pending_slot;
> >> - bool has_nonrot_disk = false;
> >> unsigned int min_pending;
> >> struct geom *geo = &conf->geo;
> >>
> >> @@ -766,7 +765,6 @@ static struct md_rdev *read_balance(struct r10conf *conf,
> >> int bad_sectors;
> >> sector_t dev_sector;
> >> unsigned int pending;
> >> - bool nonrot;
> >>
> >> if (r10_bio->devs[slot].bio == IO_BLOCKED)
> >> continue;
> >> @@ -818,10 +816,8 @@ static struct md_rdev *read_balance(struct r10conf *conf,
> >> if (!do_balance)
> >> break;
> >>
> >> - nonrot = bdev_nonrot(rdev->bdev);
> >> - has_nonrot_disk |= nonrot;
> >> pending = atomic_read(&rdev->nr_pending);
> >> - if (min_pending > pending && nonrot) {
> >> + if (min_pending > pending && test_bit(Nonrot, &rdev->flags)) {
> >> min_pending = pending;
> >> best_pending_slot = slot;
> >> best_pending_rdev = rdev;
> >> @@ -851,7 +847,7 @@ static struct md_rdev *read_balance(struct r10conf *conf,
> >> }
> >> }
> >> if (slot >= conf->copies) {
> >> - if (has_nonrot_disk) {
> >> + if (conf->mddev->nonrot_disks) {
> >> slot = best_pending_slot;
> >> rdev = best_pending_rdev;
> >> } else {
> >> --
> >> 2.39.2
> >>
> >>
> >
> > .
> >
>
On Thu, Feb 22, 2024 at 4:05 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> If resync is in progress, read_balance() should find the first usable
> disk, otherwise, data could be inconsistent after resync is done. raid1
> and raid10 implement the same checking, hence factor out the checking
> to make code cleaner.
>
> Noted that raid1 is using 'mddev->recovery_cp', which is updated after
> all resync IO is done, while raid10 is using 'conf->next_resync', which
> is inaccurate because raid10 update it before submitting resync IO.
> Fortunately, raid10 read IO can't concurrent with resync IO, hence there
> is no problem. And this patch also switch raid10 to use
> 'mddev->recovery_cp'.
>
> Co-developed-by: Paul Luse <[email protected]>
> Signed-off-by: Paul Luse <[email protected]>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/raid1-10.c | 20 ++++++++++++++++++++
> drivers/md/raid1.c | 15 ++-------------
> drivers/md/raid10.c | 13 ++-----------
> 3 files changed, 24 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
> index 9bc0f0022a6c..2ea1710a3b70 100644
> --- a/drivers/md/raid1-10.c
> +++ b/drivers/md/raid1-10.c
> @@ -276,3 +276,23 @@ static inline int raid1_check_read_range(struct md_rdev *rdev,
> *len = first_bad + bad_sectors - this_sector;
> return 0;
> }
> +
> +/*
> + * Check if read should choose the first rdev.
> + *
> + * Balance on the whole device if no resync is going on (recovery is ok) or
> + * below the resync window. Otherwise, take the first readable disk.
> + */
> +static inline bool raid1_should_read_first(struct mddev *mddev,
> + sector_t this_sector, int len)
> +{
> + if ((mddev->recovery_cp < this_sector + len))
> + return true;
> +
> + if (mddev_is_clustered(mddev) &&
> + md_cluster_ops->area_resyncing(mddev, READ, this_sector,
> + this_sector + len))
> + return true;
> +
> + return false;
> +}
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index d0bc67e6d068..8089c569e84f 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -605,11 +605,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> struct md_rdev *rdev;
> int choose_first;
>
> - /*
> - * Check if we can balance. We can balance on the whole
> - * device if no resync is going on, or below the resync window.
> - * We take the first readable disk when above the resync window.
> - */
> retry:
> sectors = r1_bio->sectors;
> best_disk = -1;
> @@ -618,16 +613,10 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> best_pending_disk = -1;
> min_pending = UINT_MAX;
> best_good_sectors = 0;
> + choose_first = raid1_should_read_first(conf->mddev, this_sector,
> + sectors);
> clear_bit(R1BIO_FailFast, &r1_bio->state);
>
> - if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> - (mddev_is_clustered(conf->mddev) &&
> - md_cluster_ops->area_resyncing(conf->mddev, READ, this_sector,
> - this_sector + sectors)))
> - choose_first = 1;
> - else
> - choose_first = 0;
> -
> for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
> sector_t dist;
> sector_t first_bad;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 1f6693e40e12..22bcc3ce11d3 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -747,17 +747,8 @@ static struct md_rdev *read_balance(struct r10conf *conf,
> best_good_sectors = 0;
> do_balance = 1;
> clear_bit(R10BIO_FailFast, &r10_bio->state);
> - /*
> - * Check if we can balance. We can balance on the whole
> - * device if no resync is going on (recovery is ok), or below
> - * the resync window. We take the first readable disk when
> - * above the resync window.
> - */
> - if ((conf->mddev->recovery_cp < MaxSector
> - && (this_sector + sectors >= conf->next_resync)) ||
> - (mddev_is_clustered(conf->mddev) &&
> - md_cluster_ops->area_resyncing(conf->mddev, READ, this_sector,
> - this_sector + sectors)))
> +
> + if (raid1_should_read_first(conf->mddev, this_sector, sectors))
> do_balance = 0;
>
> for (slot = 0; slot < conf->copies ; slot++) {
> --
> 2.39.2
>
>
This patch looks good to me.
Reviewed-by: Xiao Ni <[email protected]>
On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> read_balance() is hard to understand because there are too many status
> and branches, and it's overlong.
>
> This patch factor out the case to read the first rdev from
> read_balance(), there are no functional changes.
>
> Co-developed-by: Paul Luse <[email protected]>
> Signed-off-by: Paul Luse <[email protected]>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/raid1.c | 63 +++++++++++++++++++++++++++++++++-------------
> 1 file changed, 46 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 8089c569e84f..08c45ca55a7e 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -579,6 +579,47 @@ static sector_t align_to_barrier_unit_end(sector_t start_sector,
> return len;
> }
>
> +static void update_read_sectors(struct r1conf *conf, int disk,
> + sector_t this_sector, int len)
> +{
> + struct raid1_info *info = &conf->mirrors[disk];
> +
> + atomic_inc(&info->rdev->nr_pending);
> + if (info->next_seq_sect != this_sector)
> + info->seq_start = this_sector;
> + info->next_seq_sect = this_sector + len;
> +}
> +
> +static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> + int *max_sectors)
> +{
> + sector_t this_sector = r1_bio->sector;
> + int len = r1_bio->sectors;
> + int disk;
> +
> + for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
> + struct md_rdev *rdev;
> + int read_len;
> +
> + if (r1_bio->bios[disk] == IO_BLOCKED)
> + continue;
> +
> + rdev = conf->mirrors[disk].rdev;
> + if (!rdev || test_bit(Faulty, &rdev->flags))
> + continue;
> +
> + /* choose the first disk even if it has some bad blocks. */
> + read_len = raid1_check_read_range(rdev, this_sector, &len);
> + if (read_len > 0) {
> + update_read_sectors(conf, disk, this_sector, read_len);
> + *max_sectors = read_len;
> + return disk;
> + }
Hi Kuai
It needs to update max_sectors even if the bad block starts before
this_sector. Because it can't read more than bad_blocks from other
member disks. If it reads more data than bad blocks, it will cause
data corruption. One rule here is read from the primary disk (the
first readable disk) if it has no bad block and read the
badblock-data-length data from other disks.
Best Regards
Xiao
> + }
> +
> + return -1;
> +}
> +
> /*
> * This routine returns the disk from which the requested read should
> * be done. There is a per-array 'next expected sequential IO' sector
> @@ -603,7 +644,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> sector_t best_dist;
> unsigned int min_pending;
> struct md_rdev *rdev;
> - int choose_first;
>
> retry:
> sectors = r1_bio->sectors;
> @@ -613,10 +653,11 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> best_pending_disk = -1;
> min_pending = UINT_MAX;
> best_good_sectors = 0;
> - choose_first = raid1_should_read_first(conf->mddev, this_sector,
> - sectors);
> clear_bit(R1BIO_FailFast, &r1_bio->state);
>
> + if (raid1_should_read_first(conf->mddev, this_sector, sectors))
> + return choose_first_rdev(conf, r1_bio, max_sectors);
> +
> for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
> sector_t dist;
> sector_t first_bad;
> @@ -662,8 +703,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> * bad_sectors from another device..
> */
> bad_sectors -= (this_sector - first_bad);
> - if (choose_first && sectors > bad_sectors)
> - sectors = bad_sectors;
> if (best_good_sectors > sectors)
> best_good_sectors = sectors;
>
> @@ -673,8 +712,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> best_good_sectors = good_sectors;
> best_disk = disk;
> }
> - if (choose_first)
> - break;
> }
> continue;
> } else {
> @@ -689,10 +726,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>
> pending = atomic_read(&rdev->nr_pending);
> dist = abs(this_sector - conf->mirrors[disk].head_position);
> - if (choose_first) {
> - best_disk = disk;
> - break;
> - }
> /* Don't change to another disk for sequential reads */
> if (conf->mirrors[disk].next_seq_sect == this_sector
> || dist == 0) {
> @@ -760,13 +793,9 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> rdev = conf->mirrors[best_disk].rdev;
> if (!rdev)
> goto retry;
> - atomic_inc(&rdev->nr_pending);
> - sectors = best_good_sectors;
> -
> - if (conf->mirrors[best_disk].next_seq_sect != this_sector)
> - conf->mirrors[best_disk].seq_start = this_sector;
>
> - conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
> + sectors = best_good_sectors;
> + update_read_sectors(conf, disk, this_sector, sectors);
> }
> *max_sectors = sectors;
>
> --
> 2.39.2
>
>
Hi Kuai and Paul,
On Thu, Feb 22, 2024 at 12:03 AM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> The orignial idea is that Paul want to optimize raid1 read
> performance([1]), however, we think that the orignial code for
> read_balance() is quite complex, and we don't want to add more
> complexity. Hence we decide to refactor read_balance() first, to make
> code cleaner and easier for follow up.
>
> Before this patchset, read_balance() has many local variables and many
> braches, it want to consider all the scenarios in one iteration. The
> idea of this patch is to devide them into 4 different steps:
>
> 1) If resync is in progress, find the first usable disk, patch 5;
> Otherwise:
> 2) Loop through all disks and skipping slow disks and disks with bad
> blocks, choose the best disk, patch 10. If no disk is found:
> 3) Look for disks with bad blocks and choose the one with most number of
> sectors, patch 8. If no disk is found:
> 4) Choose first found slow disk with no bad blocks, or slow disk with
> most number of sectors, patch 7.
Thanks for your great work in this set. It looks great.
Please address feedback from folks and send v2. We can still get this in
6.9 merge window.
Thanks,
Song
Hi,
在 2024/02/26 22:16, Xiao Ni 写道:
> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
>>
>> From: Yu Kuai <[email protected]>
>>
>> read_balance() is hard to understand because there are too many status
>> and branches, and it's overlong.
>>
>> This patch factor out the case to read the first rdev from
>> read_balance(), there are no functional changes.
>>
>> Co-developed-by: Paul Luse <[email protected]>
>> Signed-off-by: Paul Luse <[email protected]>
>> Signed-off-by: Yu Kuai <[email protected]>
>> ---
>> drivers/md/raid1.c | 63 +++++++++++++++++++++++++++++++++-------------
>> 1 file changed, 46 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 8089c569e84f..08c45ca55a7e 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -579,6 +579,47 @@ static sector_t align_to_barrier_unit_end(sector_t start_sector,
>> return len;
>> }
>>
>> +static void update_read_sectors(struct r1conf *conf, int disk,
>> + sector_t this_sector, int len)
>> +{
>> + struct raid1_info *info = &conf->mirrors[disk];
>> +
>> + atomic_inc(&info->rdev->nr_pending);
>> + if (info->next_seq_sect != this_sector)
>> + info->seq_start = this_sector;
>> + info->next_seq_sect = this_sector + len;
>> +}
>> +
>> +static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>> + int *max_sectors)
>> +{
>> + sector_t this_sector = r1_bio->sector;
>> + int len = r1_bio->sectors;
>> + int disk;
>> +
>> + for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
>> + struct md_rdev *rdev;
>> + int read_len;
>> +
>> + if (r1_bio->bios[disk] == IO_BLOCKED)
>> + continue;
>> +
>> + rdev = conf->mirrors[disk].rdev;
>> + if (!rdev || test_bit(Faulty, &rdev->flags))
>> + continue;
>> +
>> + /* choose the first disk even if it has some bad blocks. */
>> + read_len = raid1_check_read_range(rdev, this_sector, &len);
>> + if (read_len > 0) {
>> + update_read_sectors(conf, disk, this_sector, read_len);
>> + *max_sectors = read_len;
>> + return disk;
>> + }
>
> Hi Kuai
>
> It needs to update max_sectors even if the bad block starts before
> this_sector. Because it can't read more than bad_blocks from other
> member disks. If it reads more data than bad blocks, it will cause
> data corruption. One rule here is read from the primary disk (the
> first readable disk) if it has no bad block and read the
> badblock-data-length data from other disks.
Noted that raid1_check_read_range() will return readable length from
this rdev, hence if bad block starts before this_sector, 0 is returned,
and 'len' is updated to the length of badblocks(if not exceed read
range), and following iteration will find the first disk to read updated
'len' data and update max_sectors.
Thanks,
Kuai
>
> Best Regards
> Xiao
>
>> + }
>> +
>> + return -1;
>> +}
>> +
>> /*
>> * This routine returns the disk from which the requested read should
>> * be done. There is a per-array 'next expected sequential IO' sector
>> @@ -603,7 +644,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> sector_t best_dist;
>> unsigned int min_pending;
>> struct md_rdev *rdev;
>> - int choose_first;
>>
>> retry:
>> sectors = r1_bio->sectors;
>> @@ -613,10 +653,11 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> best_pending_disk = -1;
>> min_pending = UINT_MAX;
>> best_good_sectors = 0;
>> - choose_first = raid1_should_read_first(conf->mddev, this_sector,
>> - sectors);
>> clear_bit(R1BIO_FailFast, &r1_bio->state);
>>
>> + if (raid1_should_read_first(conf->mddev, this_sector, sectors))
>> + return choose_first_rdev(conf, r1_bio, max_sectors);
>> +
>> for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
>> sector_t dist;
>> sector_t first_bad;
>> @@ -662,8 +703,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> * bad_sectors from another device..
>> */
>> bad_sectors -= (this_sector - first_bad);
>> - if (choose_first && sectors > bad_sectors)
>> - sectors = bad_sectors;
>> if (best_good_sectors > sectors)
>> best_good_sectors = sectors;
>>
>> @@ -673,8 +712,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> best_good_sectors = good_sectors;
>> best_disk = disk;
>> }
>> - if (choose_first)
>> - break;
>> }
>> continue;
>> } else {
>> @@ -689,10 +726,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>
>> pending = atomic_read(&rdev->nr_pending);
>> dist = abs(this_sector - conf->mirrors[disk].head_position);
>> - if (choose_first) {
>> - best_disk = disk;
>> - break;
>> - }
>> /* Don't change to another disk for sequential reads */
>> if (conf->mirrors[disk].next_seq_sect == this_sector
>> || dist == 0) {
>> @@ -760,13 +793,9 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> rdev = conf->mirrors[best_disk].rdev;
>> if (!rdev)
>> goto retry;
>> - atomic_inc(&rdev->nr_pending);
>> - sectors = best_good_sectors;
>> -
>> - if (conf->mirrors[best_disk].next_seq_sect != this_sector)
>> - conf->mirrors[best_disk].seq_start = this_sector;
>>
>> - conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
>> + sectors = best_good_sectors;
>> + update_read_sectors(conf, disk, this_sector, sectors);
>> }
>> *max_sectors = sectors;
>>
>> --
>> 2.39.2
>>
>>
>
> .
>
On Tue, Feb 27, 2024 at 9:06 AM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2024/02/26 22:16, Xiao Ni 写道:
> > On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
> >>
> >> From: Yu Kuai <[email protected]>
> >>
> >> read_balance() is hard to understand because there are too many status
> >> and branches, and it's overlong.
> >>
> >> This patch factor out the case to read the first rdev from
> >> read_balance(), there are no functional changes.
> >>
> >> Co-developed-by: Paul Luse <[email protected]>
> >> Signed-off-by: Paul Luse <[email protected]>
> >> Signed-off-by: Yu Kuai <[email protected]>
> >> ---
> >> drivers/md/raid1.c | 63 +++++++++++++++++++++++++++++++++-------------
> >> 1 file changed, 46 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> index 8089c569e84f..08c45ca55a7e 100644
> >> --- a/drivers/md/raid1.c
> >> +++ b/drivers/md/raid1.c
> >> @@ -579,6 +579,47 @@ static sector_t align_to_barrier_unit_end(sector_t start_sector,
> >> return len;
> >> }
> >>
> >> +static void update_read_sectors(struct r1conf *conf, int disk,
> >> + sector_t this_sector, int len)
> >> +{
> >> + struct raid1_info *info = &conf->mirrors[disk];
> >> +
> >> + atomic_inc(&info->rdev->nr_pending);
> >> + if (info->next_seq_sect != this_sector)
> >> + info->seq_start = this_sector;
> >> + info->next_seq_sect = this_sector + len;
> >> +}
> >> +
> >> +static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> >> + int *max_sectors)
> >> +{
> >> + sector_t this_sector = r1_bio->sector;
> >> + int len = r1_bio->sectors;
> >> + int disk;
> >> +
> >> + for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
> >> + struct md_rdev *rdev;
> >> + int read_len;
> >> +
> >> + if (r1_bio->bios[disk] == IO_BLOCKED)
> >> + continue;
> >> +
> >> + rdev = conf->mirrors[disk].rdev;
> >> + if (!rdev || test_bit(Faulty, &rdev->flags))
> >> + continue;
> >> +
> >> + /* choose the first disk even if it has some bad blocks. */
> >> + read_len = raid1_check_read_range(rdev, this_sector, &len);
> >> + if (read_len > 0) {
> >> + update_read_sectors(conf, disk, this_sector, read_len);
> >> + *max_sectors = read_len;
> >> + return disk;
> >> + }
> >
> > Hi Kuai
> >
> > It needs to update max_sectors even if the bad block starts before
> > this_sector. Because it can't read more than bad_blocks from other
> > member disks. If it reads more data than bad blocks, it will cause
> > data corruption. One rule here is read from the primary disk (the
> > first readable disk) if it has no bad block and read the
> > badblock-data-length data from other disks.
>
> Noted that raid1_check_read_range() will return readable length from
> this rdev, hence if bad block starts before this_sector, 0 is returned,
> and 'len' is updated to the length of badblocks(if not exceed read
> range), and following iteration will find the first disk to read updated
> 'len' data and update max_sectors.
Hi Kuai
The problem is that choose_first_rdev doesn't return 'len' from
max_sectors when bad blocks start before this_sector. In the following
iteration, it can't read more than 'len' from other disks to avoid
data corruption. I haven't read all the patches. To this patch, it
resets best_good_sectors to sectors when it encounters a good member
disk without bad blocks.
Regards
Xiao
>
> Thanks,
> Kuai
>
> >
> > Best Regards
> > Xiao
> >
> >> + }
> >> +
> >> + return -1;
> >> +}
> >> +
> >> /*
> >> * This routine returns the disk from which the requested read should
> >> * be done. There is a per-array 'next expected sequential IO' sector
> >> @@ -603,7 +644,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> sector_t best_dist;
> >> unsigned int min_pending;
> >> struct md_rdev *rdev;
> >> - int choose_first;
> >>
> >> retry:
> >> sectors = r1_bio->sectors;
> >> @@ -613,10 +653,11 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> best_pending_disk = -1;
> >> min_pending = UINT_MAX;
> >> best_good_sectors = 0;
> >> - choose_first = raid1_should_read_first(conf->mddev, this_sector,
> >> - sectors);
> >> clear_bit(R1BIO_FailFast, &r1_bio->state);
> >>
> >> + if (raid1_should_read_first(conf->mddev, this_sector, sectors))
> >> + return choose_first_rdev(conf, r1_bio, max_sectors);
> >> +
> >> for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
> >> sector_t dist;
> >> sector_t first_bad;
> >> @@ -662,8 +703,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> * bad_sectors from another device..
> >> */
> >> bad_sectors -= (this_sector - first_bad);
> >> - if (choose_first && sectors > bad_sectors)
> >> - sectors = bad_sectors;
> >> if (best_good_sectors > sectors)
> >> best_good_sectors = sectors;
> >>
> >> @@ -673,8 +712,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> best_good_sectors = good_sectors;
> >> best_disk = disk;
> >> }
> >> - if (choose_first)
> >> - break;
> >> }
> >> continue;
> >> } else {
> >> @@ -689,10 +726,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>
> >> pending = atomic_read(&rdev->nr_pending);
> >> dist = abs(this_sector - conf->mirrors[disk].head_position);
> >> - if (choose_first) {
> >> - best_disk = disk;
> >> - break;
> >> - }
> >> /* Don't change to another disk for sequential reads */
> >> if (conf->mirrors[disk].next_seq_sect == this_sector
> >> || dist == 0) {
> >> @@ -760,13 +793,9 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> rdev = conf->mirrors[best_disk].rdev;
> >> if (!rdev)
> >> goto retry;
> >> - atomic_inc(&rdev->nr_pending);
> >> - sectors = best_good_sectors;
> >> -
> >> - if (conf->mirrors[best_disk].next_seq_sect != this_sector)
> >> - conf->mirrors[best_disk].seq_start = this_sector;
> >>
> >> - conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
> >> + sectors = best_good_sectors;
> >> + update_read_sectors(conf, disk, this_sector, sectors);
> >> }
> >> *max_sectors = sectors;
> >>
> >> --
> >> 2.39.2
> >>
> >>
> >
> > .
> >
>
Hi,
在 2024/02/27 9:23, Xiao Ni 写道:
> On Tue, Feb 27, 2024 at 9:06 AM Yu Kuai <[email protected]> wrote:
>>
>> Hi,
>>
>> 在 2024/02/26 22:16, Xiao Ni 写道:
>>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
>>>>
>>>> From: Yu Kuai <[email protected]>
>>>>
>>>> read_balance() is hard to understand because there are too many status
>>>> and branches, and it's overlong.
>>>>
>>>> This patch factor out the case to read the first rdev from
>>>> read_balance(), there are no functional changes.
>>>>
>>>> Co-developed-by: Paul Luse <[email protected]>
>>>> Signed-off-by: Paul Luse <[email protected]>
>>>> Signed-off-by: Yu Kuai <[email protected]>
>>>> ---
>>>> drivers/md/raid1.c | 63 +++++++++++++++++++++++++++++++++-------------
>>>> 1 file changed, 46 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>> index 8089c569e84f..08c45ca55a7e 100644
>>>> --- a/drivers/md/raid1.c
>>>> +++ b/drivers/md/raid1.c
>>>> @@ -579,6 +579,47 @@ static sector_t align_to_barrier_unit_end(sector_t start_sector,
>>>> return len;
>>>> }
>>>>
>>>> +static void update_read_sectors(struct r1conf *conf, int disk,
>>>> + sector_t this_sector, int len)
>>>> +{
>>>> + struct raid1_info *info = &conf->mirrors[disk];
>>>> +
>>>> + atomic_inc(&info->rdev->nr_pending);
>>>> + if (info->next_seq_sect != this_sector)
>>>> + info->seq_start = this_sector;
>>>> + info->next_seq_sect = this_sector + len;
>>>> +}
>>>> +
>>>> +static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>>>> + int *max_sectors)
>>>> +{
>>>> + sector_t this_sector = r1_bio->sector;
>>>> + int len = r1_bio->sectors;
>>>> + int disk;
>>>> +
>>>> + for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
>>>> + struct md_rdev *rdev;
>>>> + int read_len;
>>>> +
>>>> + if (r1_bio->bios[disk] == IO_BLOCKED)
>>>> + continue;
>>>> +
>>>> + rdev = conf->mirrors[disk].rdev;
>>>> + if (!rdev || test_bit(Faulty, &rdev->flags))
>>>> + continue;
>>>> +
>>>> + /* choose the first disk even if it has some bad blocks. */
>>>> + read_len = raid1_check_read_range(rdev, this_sector, &len);
>>>> + if (read_len > 0) {
>>>> + update_read_sectors(conf, disk, this_sector, read_len);
>>>> + *max_sectors = read_len;
>>>> + return disk;
>>>> + }
>>>
>>> Hi Kuai
>>>
>>> It needs to update max_sectors even if the bad block starts before
>>> this_sector. Because it can't read more than bad_blocks from other
>>> member disks. If it reads more data than bad blocks, it will cause
>>> data corruption. One rule here is read from the primary disk (the
>>> first readable disk) if it has no bad block and read the
>>> badblock-data-length data from other disks.
>>
>> Noted that raid1_check_read_range() will return readable length from
>> this rdev, hence if bad block starts before this_sector, 0 is returned,
>> and 'len' is updated to the length of badblocks(if not exceed read
>> range), and following iteration will find the first disk to read updated
>> 'len' data and update max_sectors.
>
> Hi Kuai
>
> The problem is that choose_first_rdev doesn't return 'len' from
> max_sectors when bad blocks start before this_sector. In the following
> iteration, it can't read more than 'len' from other disks to avoid
> data corruption. I haven't read all the patches. To this patch, it
> resets best_good_sectors to sectors when it encounters a good member
> disk without bad blocks.
In this case, 'len' is not supposed to be returned, caller will split
orignal IO based on 'max_sectors', for example:
IO: 2, 4 | ----
rdev1: BB: 0, 4 |xxxx
rdev2: no BB
Then choose_first_rdev() will set max_sectors to 2, and return rdev2,
then caller will split and issue new IO:
orignal IO: 4, 2 | --
splited IO: 2, 2 | --
Finally, issue splited IO to rdev2. Later orignal IO will be handled by
read_balance() again, and rdev1 will be returned.
Is this case what you concerned?
Thanks,
Kuai
>
> Regards
> Xiao
>>
>> Thanks,
>> Kuai
>>
>>>
>>> Best Regards
>>> Xiao
>>>
>>>> + }
>>>> +
>>>> + return -1;
>>>> +}
>>>> +
>>>> /*
>>>> * This routine returns the disk from which the requested read should
>>>> * be done. There is a per-array 'next expected sequential IO' sector
>>>> @@ -603,7 +644,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> sector_t best_dist;
>>>> unsigned int min_pending;
>>>> struct md_rdev *rdev;
>>>> - int choose_first;
>>>>
>>>> retry:
>>>> sectors = r1_bio->sectors;
>>>> @@ -613,10 +653,11 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> best_pending_disk = -1;
>>>> min_pending = UINT_MAX;
>>>> best_good_sectors = 0;
>>>> - choose_first = raid1_should_read_first(conf->mddev, this_sector,
>>>> - sectors);
>>>> clear_bit(R1BIO_FailFast, &r1_bio->state);
>>>>
>>>> + if (raid1_should_read_first(conf->mddev, this_sector, sectors))
>>>> + return choose_first_rdev(conf, r1_bio, max_sectors);
>>>> +
>>>> for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
>>>> sector_t dist;
>>>> sector_t first_bad;
>>>> @@ -662,8 +703,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> * bad_sectors from another device..
>>>> */
>>>> bad_sectors -= (this_sector - first_bad);
>>>> - if (choose_first && sectors > bad_sectors)
>>>> - sectors = bad_sectors;
>>>> if (best_good_sectors > sectors)
>>>> best_good_sectors = sectors;
>>>>
>>>> @@ -673,8 +712,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> best_good_sectors = good_sectors;
>>>> best_disk = disk;
>>>> }
>>>> - if (choose_first)
>>>> - break;
>>>> }
>>>> continue;
>>>> } else {
>>>> @@ -689,10 +726,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>
>>>> pending = atomic_read(&rdev->nr_pending);
>>>> dist = abs(this_sector - conf->mirrors[disk].head_position);
>>>> - if (choose_first) {
>>>> - best_disk = disk;
>>>> - break;
>>>> - }
>>>> /* Don't change to another disk for sequential reads */
>>>> if (conf->mirrors[disk].next_seq_sect == this_sector
>>>> || dist == 0) {
>>>> @@ -760,13 +793,9 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> rdev = conf->mirrors[best_disk].rdev;
>>>> if (!rdev)
>>>> goto retry;
>>>> - atomic_inc(&rdev->nr_pending);
>>>> - sectors = best_good_sectors;
>>>> -
>>>> - if (conf->mirrors[best_disk].next_seq_sect != this_sector)
>>>> - conf->mirrors[best_disk].seq_start = this_sector;
>>>>
>>>> - conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
>>>> + sectors = best_good_sectors;
>>>> + update_read_sectors(conf, disk, this_sector, sectors);
>>>> }
>>>> *max_sectors = sectors;
>>>>
>>>> --
>>>> 2.39.2
>>>>
>>>>
>>>
>>> .
>>>
>>
>
> .
>
On Tue, Feb 27, 2024 at 9:23 AM Xiao Ni <[email protected]> wrote:
>
> On Tue, Feb 27, 2024 at 9:06 AM Yu Kuai <[email protected]> wrote:
> >
> > Hi,
> >
> > 在 2024/02/26 22:16, Xiao Ni 写道:
> > > On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
> > >>
> > >> From: Yu Kuai <[email protected]>
> > >>
> > >> read_balance() is hard to understand because there are too many status
> > >> and branches, and it's overlong.
> > >>
> > >> This patch factor out the case to read the first rdev from
> > >> read_balance(), there are no functional changes.
> > >>
> > >> Co-developed-by: Paul Luse <[email protected]>
> > >> Signed-off-by: Paul Luse <[email protected]>
> > >> Signed-off-by: Yu Kuai <[email protected]>
> > >> ---
> > >> drivers/md/raid1.c | 63 +++++++++++++++++++++++++++++++++-------------
> > >> 1 file changed, 46 insertions(+), 17 deletions(-)
> > >>
> > >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > >> index 8089c569e84f..08c45ca55a7e 100644
> > >> --- a/drivers/md/raid1.c
> > >> +++ b/drivers/md/raid1.c
> > >> @@ -579,6 +579,47 @@ static sector_t align_to_barrier_unit_end(sector_t start_sector,
> > >> return len;
> > >> }
> > >>
> > >> +static void update_read_sectors(struct r1conf *conf, int disk,
> > >> + sector_t this_sector, int len)
> > >> +{
> > >> + struct raid1_info *info = &conf->mirrors[disk];
> > >> +
> > >> + atomic_inc(&info->rdev->nr_pending);
> > >> + if (info->next_seq_sect != this_sector)
> > >> + info->seq_start = this_sector;
> > >> + info->next_seq_sect = this_sector + len;
> > >> +}
> > >> +
> > >> +static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> > >> + int *max_sectors)
> > >> +{
> > >> + sector_t this_sector = r1_bio->sector;
> > >> + int len = r1_bio->sectors;
> > >> + int disk;
> > >> +
> > >> + for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
> > >> + struct md_rdev *rdev;
> > >> + int read_len;
> > >> +
> > >> + if (r1_bio->bios[disk] == IO_BLOCKED)
> > >> + continue;
> > >> +
> > >> + rdev = conf->mirrors[disk].rdev;
> > >> + if (!rdev || test_bit(Faulty, &rdev->flags))
> > >> + continue;
> > >> +
> > >> + /* choose the first disk even if it has some bad blocks. */
> > >> + read_len = raid1_check_read_range(rdev, this_sector, &len);
> > >> + if (read_len > 0) {
> > >> + update_read_sectors(conf, disk, this_sector, read_len);
> > >> + *max_sectors = read_len;
> > >> + return disk;
> > >> + }
> > >
> > > Hi Kuai
> > >
> > > It needs to update max_sectors even if the bad block starts before
> > > this_sector. Because it can't read more than bad_blocks from other
> > > member disks. If it reads more data than bad blocks, it will cause
> > > data corruption. One rule here is read from the primary disk (the
> > > first readable disk) if it has no bad block and read the
> > > badblock-data-length data from other disks.
> >
> > Noted that raid1_check_read_range() will return readable length from
> > this rdev, hence if bad block starts before this_sector, 0 is returned,
> > and 'len' is updated to the length of badblocks(if not exceed read
> > range), and following iteration will find the first disk to read updated
> > 'len' data and update max_sectors.
>
> Hi Kuai
>
> The problem is that choose_first_rdev doesn't return 'len' from
> max_sectors when bad blocks start before this_sector. In the following
> iteration, it can't read more than 'len' from other disks to avoid
> data corruption. I haven't read all the patches. To this patch, it
> resets best_good_sectors to sectors when it encounters a good member
> disk without bad blocks.
Sorry, I missed one thing. Beside the point I mentioned above, it
needs to update 'sectors' which is the the read region to 'len'
finally. It looks like the raid1_check_read_range needs to modify to
achieve this.
Regards
Xiao
>
> Regards
> Xiao
> >
> > Thanks,
> > Kuai
> >
> > >
> > > Best Regards
> > > Xiao
> > >
> > >> + }
> > >> +
> > >> + return -1;
> > >> +}
> > >> +
> > >> /*
> > >> * This routine returns the disk from which the requested read should
> > >> * be done. There is a per-array 'next expected sequential IO' sector
> > >> @@ -603,7 +644,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> > >> sector_t best_dist;
> > >> unsigned int min_pending;
> > >> struct md_rdev *rdev;
> > >> - int choose_first;
> > >>
> > >> retry:
> > >> sectors = r1_bio->sectors;
> > >> @@ -613,10 +653,11 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> > >> best_pending_disk = -1;
> > >> min_pending = UINT_MAX;
> > >> best_good_sectors = 0;
> > >> - choose_first = raid1_should_read_first(conf->mddev, this_sector,
> > >> - sectors);
> > >> clear_bit(R1BIO_FailFast, &r1_bio->state);
> > >>
> > >> + if (raid1_should_read_first(conf->mddev, this_sector, sectors))
> > >> + return choose_first_rdev(conf, r1_bio, max_sectors);
> > >> +
> > >> for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
> > >> sector_t dist;
> > >> sector_t first_bad;
> > >> @@ -662,8 +703,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> > >> * bad_sectors from another device..
> > >> */
> > >> bad_sectors -= (this_sector - first_bad);
> > >> - if (choose_first && sectors > bad_sectors)
> > >> - sectors = bad_sectors;
> > >> if (best_good_sectors > sectors)
> > >> best_good_sectors = sectors;
> > >>
> > >> @@ -673,8 +712,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> > >> best_good_sectors = good_sectors;
> > >> best_disk = disk;
> > >> }
> > >> - if (choose_first)
> > >> - break;
> > >> }
> > >> continue;
> > >> } else {
> > >> @@ -689,10 +726,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> > >>
> > >> pending = atomic_read(&rdev->nr_pending);
> > >> dist = abs(this_sector - conf->mirrors[disk].head_position);
> > >> - if (choose_first) {
> > >> - best_disk = disk;
> > >> - break;
> > >> - }
> > >> /* Don't change to another disk for sequential reads */
> > >> if (conf->mirrors[disk].next_seq_sect == this_sector
> > >> || dist == 0) {
> > >> @@ -760,13 +793,9 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> > >> rdev = conf->mirrors[best_disk].rdev;
> > >> if (!rdev)
> > >> goto retry;
> > >> - atomic_inc(&rdev->nr_pending);
> > >> - sectors = best_good_sectors;
> > >> -
> > >> - if (conf->mirrors[best_disk].next_seq_sect != this_sector)
> > >> - conf->mirrors[best_disk].seq_start = this_sector;
> > >>
> > >> - conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
> > >> + sectors = best_good_sectors;
> > >> + update_read_sectors(conf, disk, this_sector, sectors);
> > >> }
> > >> *max_sectors = sectors;
> > >>
> > >> --
> > >> 2.39.2
> > >>
> > >>
> > >
> > > .
> > >
> >
On Thu, Feb 22, 2024 at 4:06 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> read_balance() is hard to understand because there are too many status
> and branches, and it's overlong.
>
> This patch factor out the case to read the rdev with bad blocks from
> read_balance(), there are no functional changes.
>
> Co-developed-by: Paul Luse <[email protected]>
> Signed-off-by: Paul Luse <[email protected]>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/raid1.c | 79 ++++++++++++++++++++++++++++------------------
> 1 file changed, 48 insertions(+), 31 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index bc2f8fcbe5b3..4694e0e71e36 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -620,6 +620,44 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> return -1;
> }
>
> +static int choose_bb_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> + int *max_sectors)
> +{
> + sector_t this_sector = r1_bio->sector;
> + int best_disk = -1;
> + int best_len = 0;
> + int disk;
> +
> + for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
> + struct md_rdev *rdev;
> + int len;
> + int read_len;
> +
> + if (r1_bio->bios[disk] == IO_BLOCKED)
> + continue;
> +
> + rdev = conf->mirrors[disk].rdev;
> + if (!rdev || test_bit(Faulty, &rdev->flags) ||
> + test_bit(WriteMostly, &rdev->flags))
> + continue;
> +
> + /* keep track of the disk with the most readable sectors. */
> + len = r1_bio->sectors;
> + read_len = raid1_check_read_range(rdev, this_sector, &len);
> + if (read_len > best_len) {
> + best_disk = disk;
> + best_len = read_len;
> + }
> + }
> +
> + if (best_disk != -1) {
> + *max_sectors = best_len;
> + update_read_sectors(conf, best_disk, this_sector, best_len);
> + }
> +
> + return best_disk;
> +}
> +
> static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> int *max_sectors)
> {
> @@ -707,8 +745,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>
> for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
> sector_t dist;
> - sector_t first_bad;
> - int bad_sectors;
> unsigned int pending;
>
> rdev = conf->mirrors[disk].rdev;
> @@ -721,36 +757,8 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> continue;
> if (test_bit(WriteMostly, &rdev->flags))
> continue;
> - /* This is a reasonable device to use. It might
> - * even be best.
> - */
> - if (is_badblock(rdev, this_sector, sectors,
> - &first_bad, &bad_sectors)) {
> - if (best_dist < MaxSector)
> - /* already have a better device */
> - continue;
> - if (first_bad <= this_sector) {
> - /* cannot read here. If this is the 'primary'
> - * device, then we must not read beyond
> - * bad_sectors from another device..
> - */
> - bad_sectors -= (this_sector - first_bad);
> - if (best_good_sectors > sectors)
> - best_good_sectors = sectors;
> -
> - } else {
> - sector_t good_sectors = first_bad - this_sector;
> - if (good_sectors > best_good_sectors) {
> - best_good_sectors = good_sectors;
> - best_disk = disk;
> - }
> - }
> + if (rdev_has_badblock(rdev, this_sector, sectors))
> continue;
> - } else {
> - if ((sectors > best_good_sectors) && (best_disk >= 0))
> - best_disk = -1;
> - best_good_sectors = sectors;
> - }
>
> if (best_disk >= 0)
> /* At least two disks to choose from so failfast is OK */
> @@ -834,6 +842,15 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> if (best_disk >= 0)
> return best_disk;
>
> + /*
> + * If we are here it means we didn't find a perfectly good disk so
> + * now spend a bit more time trying to find one with the most good
> + * sectors.
> + */
> + disk = choose_bb_rdev(conf, r1_bio, max_sectors);
> + if (disk >= 0)
> + return disk;
> +
> return choose_slow_rdev(conf, r1_bio, max_sectors);
> }
>
> --
> 2.39.2
>
>
Hi
This patch looks good to me.
Reviewed-by: Xiao Ni <[email protected]>
On Thu, Feb 22, 2024 at 4:05 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> There is no functional change for now, make read_balance() cleaner and
> prepare to fix problems and refactor the handler of sequential IO.
>
> Co-developed-by: Paul Luse <[email protected]>
> Signed-off-by: Paul Luse <[email protected]>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/raid1.c | 71 +++++++++++++++++++++++++---------------------
> 1 file changed, 38 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 4694e0e71e36..223ef8d06f67 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -705,6 +705,31 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> return bb_disk;
> }
>
> +static bool is_sequential(struct r1conf *conf, int disk, struct r1bio *r1_bio)
> +{
> + /* TODO: address issues with this check and concurrency. */
> + return conf->mirrors[disk].next_seq_sect == r1_bio->sector ||
> + conf->mirrors[disk].head_position == r1_bio->sector;
> +}
> +
> +/*
> + * If buffered sequential IO size exceeds optimal iosize, check if there is idle
> + * disk. If yes, choose the idle disk.
> + */
> +static bool should_choose_next(struct r1conf *conf, int disk)
> +{
> + struct raid1_info *mirror = &conf->mirrors[disk];
> + int opt_iosize;
> +
> + if (!test_bit(Nonrot, &mirror->rdev->flags))
> + return false;
> +
> + opt_iosize = bdev_io_opt(mirror->rdev->bdev) >> 9;
> + return opt_iosize > 0 && mirror->seq_start != MaxSector &&
> + mirror->next_seq_sect > opt_iosize &&
> + mirror->next_seq_sect - opt_iosize >= mirror->seq_start;
> +}
> +
> /*
> * This routine returns the disk from which the requested read should
> * be done. There is a per-array 'next expected sequential IO' sector
> @@ -767,42 +792,22 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> pending = atomic_read(&rdev->nr_pending);
> dist = abs(this_sector - conf->mirrors[disk].head_position);
> /* Don't change to another disk for sequential reads */
> - if (conf->mirrors[disk].next_seq_sect == this_sector
> - || dist == 0) {
> - int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
> - struct raid1_info *mirror = &conf->mirrors[disk];
> -
> - /*
> - * If buffered sequential IO size exceeds optimal
> - * iosize, check if there is idle disk. If yes, choose
> - * the idle disk. read_balance could already choose an
> - * idle disk before noticing it's a sequential IO in
> - * this disk. This doesn't matter because this disk
> - * will idle, next time it will be utilized after the
> - * first disk has IO size exceeds optimal iosize. In
> - * this way, iosize of the first disk will be optimal
> - * iosize at least. iosize of the second disk might be
> - * small, but not a big deal since when the second disk
> - * starts IO, the first disk is likely still busy.
> - */
> - if (test_bit(Nonrot, &rdev->flags) && opt_iosize > 0 &&
> - mirror->seq_start != MaxSector &&
> - mirror->next_seq_sect > opt_iosize &&
> - mirror->next_seq_sect - opt_iosize >=
> - mirror->seq_start) {
> - /*
> - * Add 'pending' to avoid choosing this disk if
> - * there is other idle disk.
> - * Set 'dist' to 0, so that if there is no other
> - * idle disk and all disks are rotational, this
> - * disk will still be chosen.
> - */
> - pending++;
> - dist = 0;
> - } else {
> + if (is_sequential(conf, disk, r1_bio)) {
> + if (!should_choose_next(conf, disk)) {
> best_disk = disk;
> break;
> }
> +
> + /*
> + * Add 'pending' to avoid choosing this disk if there is
> + * other idle disk.
> + */
> + pending++;
> + /*
> + * Set 'dist' to 0, so that if there is no other idle
> + * disk, this disk will still be chosen.
> + */
> + dist = 0;
> }
>
> if (min_pending > pending) {
> --
> 2.39.2
>
>
Hi
This patch looks good to me.
Reviewed-by: Xiao Ni <[email protected]>
On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> the case choose next idle in read_balance():
>
> read_balance:
> for_each_rdev
> if(next_seq_sect == this_sector || disk == 0)
> -> sequential reads
> best_disk = disk;
> if (...)
> choose_next_idle = 1
> continue;
>
> for_each_rdev
> -> iterate next rdev
> if (pending == 0)
> best_disk = disk;
> -> choose the next idle disk
> break;
>
> if (choose_next_idle)
> -> keep using this rdev if there are no other idle disk
> contine
>
> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> remove the code:
>
> - /* If device is idle, use it */
> - if (pending == 0) {
> - best_disk = disk;
> - break;
> - }
>
> Hence choose next idle will never work now, fix this problem by
> following:
>
> 1) don't set best_disk in this case, read_balance() will choose the best
> disk after iterating all the disks;
> 2) add 'pending' so that other idle disk will be chosen;
> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
> are rotational, this disk will still be chosen;
>
> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> Co-developed-by: Paul Luse <[email protected]>
> Signed-off-by: Paul Luse <[email protected]>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/raid1.c | 21 ++++++++++++---------
> 1 file changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index c60ea58ae8c5..d0bc67e6d068 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> unsigned int min_pending;
> struct md_rdev *rdev;
> int choose_first;
> - int choose_next_idle;
>
> /*
> * Check if we can balance. We can balance on the whole
> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> best_pending_disk = -1;
> min_pending = UINT_MAX;
> best_good_sectors = 0;
> - choose_next_idle = 0;
> clear_bit(R1BIO_FailFast, &r1_bio->state);
>
> if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
> struct raid1_info *mirror = &conf->mirrors[disk];
>
> - best_disk = disk;
> /*
> * If buffered sequential IO size exceeds optimal
> * iosize, check if there is idle disk. If yes, choose
> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> mirror->next_seq_sect > opt_iosize &&
> mirror->next_seq_sect - opt_iosize >=
> mirror->seq_start) {
> - choose_next_idle = 1;
> - continue;
> + /*
> + * Add 'pending' to avoid choosing this disk if
> + * there is other idle disk.
> + * Set 'dist' to 0, so that if there is no other
> + * idle disk and all disks are rotational, this
> + * disk will still be chosen.
> + */
> + pending++;
> + dist = 0;
> + } else {
> + best_disk = disk;
> + break;
> }
> - break;
> }
Hi Kuai
I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's
a sequential read. If there are no other idle disks, it will read from
the sequential disk. With this patch, it reads from the
best_pending_disk even min_pending is not 0. It looks like a wrong
behaviour?
Best Regards
Xiao
>
> - if (choose_next_idle)
> - continue;
> -
> if (min_pending > pending) {
> min_pending = pending;
> best_pending_disk = disk;
> --
> 2.39.2
>
>
Hi,
在 2024/02/27 10:23, Xiao Ni 写道:
> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
>>
>> From: Yu Kuai <[email protected]>
>>
>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
>> the case choose next idle in read_balance():
>>
>> read_balance:
>> for_each_rdev
>> if(next_seq_sect == this_sector || disk == 0)
>> -> sequential reads
>> best_disk = disk;
>> if (...)
>> choose_next_idle = 1
>> continue;
>>
>> for_each_rdev
>> -> iterate next rdev
>> if (pending == 0)
>> best_disk = disk;
>> -> choose the next idle disk
>> break;
>>
>> if (choose_next_idle)
>> -> keep using this rdev if there are no other idle disk
>> contine
>>
>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>> remove the code:
>>
>> - /* If device is idle, use it */
>> - if (pending == 0) {
>> - best_disk = disk;
>> - break;
>> - }
>>
>> Hence choose next idle will never work now, fix this problem by
>> following:
>>
>> 1) don't set best_disk in this case, read_balance() will choose the best
>> disk after iterating all the disks;
>> 2) add 'pending' so that other idle disk will be chosen;
>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>> are rotational, this disk will still be chosen;
>>
>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>> Co-developed-by: Paul Luse <[email protected]>
>> Signed-off-by: Paul Luse <[email protected]>
>> Signed-off-by: Yu Kuai <[email protected]>
>> ---
>> drivers/md/raid1.c | 21 ++++++++++++---------
>> 1 file changed, 12 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index c60ea58ae8c5..d0bc67e6d068 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> unsigned int min_pending;
>> struct md_rdev *rdev;
>> int choose_first;
>> - int choose_next_idle;
>>
>> /*
>> * Check if we can balance. We can balance on the whole
>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> best_pending_disk = -1;
>> min_pending = UINT_MAX;
>> best_good_sectors = 0;
>> - choose_next_idle = 0;
>> clear_bit(R1BIO_FailFast, &r1_bio->state);
>>
>> if ((conf->mddev->recovery_cp < this_sector + sectors) ||
>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>> struct raid1_info *mirror = &conf->mirrors[disk];
>>
>> - best_disk = disk;
>> /*
>> * If buffered sequential IO size exceeds optimal
>> * iosize, check if there is idle disk. If yes, choose
>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>> mirror->next_seq_sect > opt_iosize &&
>> mirror->next_seq_sect - opt_iosize >=
>> mirror->seq_start) {
>> - choose_next_idle = 1;
>> - continue;
>> + /*
>> + * Add 'pending' to avoid choosing this disk if
>> + * there is other idle disk.
>> + * Set 'dist' to 0, so that if there is no other
>> + * idle disk and all disks are rotational, this
>> + * disk will still be chosen.
>> + */
>> + pending++;
>> + dist = 0;
>> + } else {
>> + best_disk = disk;
>> + break;
>> }
>> - break;
>> }
>
> Hi Kuai
>
> I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's
> a sequential read. If there are no other idle disks, it will read from
> the sequential disk. With this patch, it reads from the
> best_pending_disk even min_pending is not 0. It looks like a wrong
> behaviour?
Yes, nice catch, I didn't notice this yet... So there is a hidden
logical, sequential IO priority is higher than minimal 'pending'
selection, it's only less than 'choose_next_idle' where idle disk
exist.
Looks like if we want to keep this behaviour, we can add a 'sequential
disk':
if (is_sequential())
if (!should_choose_next())
return disk;
ctl.sequential_disk = disk;
..
if (ctl.min_pending != 0 && ctl.sequential_disk != -1)
return ctl.sequential_disk;
Thanks,
Kuai
>
> Best Regards
> Xiao
>>
>> - if (choose_next_idle)
>> - continue;
>> -
>> if (min_pending > pending) {
>> min_pending = pending;
>> best_pending_disk = disk;
>> --
>> 2.39.2
>>
>>
>
> .
>
On Tue, Feb 27, 2024 at 9:44 AM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2024/02/27 9:23, Xiao Ni 写道:
> > On Tue, Feb 27, 2024 at 9:06 AM Yu Kuai <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> 在 2024/02/26 22:16, Xiao Ni 写道:
> >>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
> >>>>
> >>>> From: Yu Kuai <[email protected]>
> >>>>
> >>>> read_balance() is hard to understand because there are too many status
> >>>> and branches, and it's overlong.
> >>>>
> >>>> This patch factor out the case to read the first rdev from
> >>>> read_balance(), there are no functional changes.
> >>>>
> >>>> Co-developed-by: Paul Luse <[email protected]>
> >>>> Signed-off-by: Paul Luse <[email protected]>
> >>>> Signed-off-by: Yu Kuai <[email protected]>
> >>>> ---
> >>>> drivers/md/raid1.c | 63 +++++++++++++++++++++++++++++++++-------------
> >>>> 1 file changed, 46 insertions(+), 17 deletions(-)
> >>>>
> >>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >>>> index 8089c569e84f..08c45ca55a7e 100644
> >>>> --- a/drivers/md/raid1.c
> >>>> +++ b/drivers/md/raid1.c
> >>>> @@ -579,6 +579,47 @@ static sector_t align_to_barrier_unit_end(sector_t start_sector,
> >>>> return len;
> >>>> }
> >>>>
> >>>> +static void update_read_sectors(struct r1conf *conf, int disk,
> >>>> + sector_t this_sector, int len)
> >>>> +{
> >>>> + struct raid1_info *info = &conf->mirrors[disk];
> >>>> +
> >>>> + atomic_inc(&info->rdev->nr_pending);
> >>>> + if (info->next_seq_sect != this_sector)
> >>>> + info->seq_start = this_sector;
> >>>> + info->next_seq_sect = this_sector + len;
> >>>> +}
> >>>> +
> >>>> +static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> >>>> + int *max_sectors)
> >>>> +{
> >>>> + sector_t this_sector = r1_bio->sector;
> >>>> + int len = r1_bio->sectors;
> >>>> + int disk;
> >>>> +
> >>>> + for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
> >>>> + struct md_rdev *rdev;
> >>>> + int read_len;
> >>>> +
> >>>> + if (r1_bio->bios[disk] == IO_BLOCKED)
> >>>> + continue;
> >>>> +
> >>>> + rdev = conf->mirrors[disk].rdev;
> >>>> + if (!rdev || test_bit(Faulty, &rdev->flags))
> >>>> + continue;
> >>>> +
> >>>> + /* choose the first disk even if it has some bad blocks. */
> >>>> + read_len = raid1_check_read_range(rdev, this_sector, &len);
> >>>> + if (read_len > 0) {
> >>>> + update_read_sectors(conf, disk, this_sector, read_len);
> >>>> + *max_sectors = read_len;
> >>>> + return disk;
> >>>> + }
> >>>
> >>> Hi Kuai
> >>>
> >>> It needs to update max_sectors even if the bad block starts before
> >>> this_sector. Because it can't read more than bad_blocks from other
> >>> member disks. If it reads more data than bad blocks, it will cause
> >>> data corruption. One rule here is read from the primary disk (the
> >>> first readable disk) if it has no bad block and read the
> >>> badblock-data-length data from other disks.
> >>
> >> Noted that raid1_check_read_range() will return readable length from
> >> this rdev, hence if bad block starts before this_sector, 0 is returned,
> >> and 'len' is updated to the length of badblocks(if not exceed read
> >> range), and following iteration will find the first disk to read updated
> >> 'len' data and update max_sectors.
> >
> > Hi Kuai
> >
> > The problem is that choose_first_rdev doesn't return 'len' from
> > max_sectors when bad blocks start before this_sector. In the following
> > iteration, it can't read more than 'len' from other disks to avoid
> > data corruption. I haven't read all the patches. To this patch, it
> > resets best_good_sectors to sectors when it encounters a good member
> > disk without bad blocks.
>
> In this case, 'len' is not supposed to be returned, caller will split
> orignal IO based on 'max_sectors', for example:
>
> IO: 2, 4 | ----
> rdev1: BB: 0, 4 |xxxx
> rdev2: no BB
>
> Then choose_first_rdev() will set max_sectors to 2, and return rdev2,
> then caller will split and issue new IO:
>
> orignal IO: 4, 2 | --
> splited IO: 2, 2 | --
>
> Finally, issue splited IO to rdev2. Later orignal IO will be handled by
> read_balance() again, and rdev1 will be returned.
>
> Is this case what you concerned?
Ah I was still in the original logic and forgot choose_first_rdev is
iterates all disks.
The case I want to talk is:
bad block range: 1------8
first readable disk: 4-----------16
The io starts at 4, but the bad block starts at 1 and length is 8.
raid1_check_read_range returns 0 and len is updated to 5. In the loop,
it can split the io as expected. Thanks for the explanation.
Best Regards
Xiao
>
> Thanks,
> Kuai
>
> >
> > Regards
> > Xiao
> >>
> >> Thanks,
> >> Kuai
> >>
> >>>
> >>> Best Regards
> >>> Xiao
> >>>
> >>>> + }
> >>>> +
> >>>> + return -1;
> >>>> +}
> >>>> +
> >>>> /*
> >>>> * This routine returns the disk from which the requested read should
> >>>> * be done. There is a per-array 'next expected sequential IO' sector
> >>>> @@ -603,7 +644,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>> sector_t best_dist;
> >>>> unsigned int min_pending;
> >>>> struct md_rdev *rdev;
> >>>> - int choose_first;
> >>>>
> >>>> retry:
> >>>> sectors = r1_bio->sectors;
> >>>> @@ -613,10 +653,11 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>> best_pending_disk = -1;
> >>>> min_pending = UINT_MAX;
> >>>> best_good_sectors = 0;
> >>>> - choose_first = raid1_should_read_first(conf->mddev, this_sector,
> >>>> - sectors);
> >>>> clear_bit(R1BIO_FailFast, &r1_bio->state);
> >>>>
> >>>> + if (raid1_should_read_first(conf->mddev, this_sector, sectors))
> >>>> + return choose_first_rdev(conf, r1_bio, max_sectors);
> >>>> +
> >>>> for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
> >>>> sector_t dist;
> >>>> sector_t first_bad;
> >>>> @@ -662,8 +703,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>> * bad_sectors from another device.
> >>>> */
> >>>> bad_sectors -= (this_sector - first_bad);
> >>>> - if (choose_first && sectors > bad_sectors)
> >>>> - sectors = bad_sectors;
> >>>> if (best_good_sectors > sectors)
> >>>> best_good_sectors = sectors;
> >>>>
> >>>> @@ -673,8 +712,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>> best_good_sectors = good_sectors;
> >>>> best_disk = disk;
> >>>> }
> >>>> - if (choose_first)
> >>>> - break;
> >>>> }
> >>>> continue;
> >>>> } else {
> >>>> @@ -689,10 +726,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>>
> >>>> pending = atomic_read(&rdev->nr_pending);
> >>>> dist = abs(this_sector - conf->mirrors[disk].head_position);
> >>>> - if (choose_first) {
> >>>> - best_disk = disk;
> >>>> - break;
> >>>> - }
> >>>> /* Don't change to another disk for sequential reads */
> >>>> if (conf->mirrors[disk].next_seq_sect == this_sector
> >>>> || dist == 0) {
> >>>> @@ -760,13 +793,9 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>> rdev = conf->mirrors[best_disk].rdev;
> >>>> if (!rdev)
> >>>> goto retry;
> >>>> - atomic_inc(&rdev->nr_pending);
> >>>> - sectors = best_good_sectors;
> >>>> -
> >>>> - if (conf->mirrors[best_disk].next_seq_sect != this_sector)
> >>>> - conf->mirrors[best_disk].seq_start = this_sector;
> >>>>
> >>>> - conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
> >>>> + sectors = best_good_sectors;
> >>>> + update_read_sectors(conf, disk, this_sector, sectors);
> >>>> }
> >>>> *max_sectors = sectors;
> >>>>
> >>>> --
> >>>> 2.39.2
> >>>>
> >>>>
> >>>
> >>> .
> >>>
> >>
> >
> > .
> >
>
On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> read_balance() is hard to understand because there are too many status
> and branches, and it's overlong.
>
> This patch factor out the case to read the first rdev from
> read_balance(), there are no functional changes.
>
> Co-developed-by: Paul Luse <[email protected]>
> Signed-off-by: Paul Luse <[email protected]>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/raid1.c | 63 +++++++++++++++++++++++++++++++++-------------
> 1 file changed, 46 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 8089c569e84f..08c45ca55a7e 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -579,6 +579,47 @@ static sector_t align_to_barrier_unit_end(sector_t start_sector,
> return len;
> }
>
> +static void update_read_sectors(struct r1conf *conf, int disk,
> + sector_t this_sector, int len)
> +{
> + struct raid1_info *info = &conf->mirrors[disk];
> +
> + atomic_inc(&info->rdev->nr_pending);
> + if (info->next_seq_sect != this_sector)
> + info->seq_start = this_sector;
> + info->next_seq_sect = this_sector + len;
> +}
> +
> +static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> + int *max_sectors)
> +{
> + sector_t this_sector = r1_bio->sector;
> + int len = r1_bio->sectors;
> + int disk;
> +
> + for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
> + struct md_rdev *rdev;
> + int read_len;
> +
> + if (r1_bio->bios[disk] == IO_BLOCKED)
> + continue;
> +
> + rdev = conf->mirrors[disk].rdev;
> + if (!rdev || test_bit(Faulty, &rdev->flags))
> + continue;
> +
> + /* choose the first disk even if it has some bad blocks. */
> + read_len = raid1_check_read_range(rdev, this_sector, &len);
> + if (read_len > 0) {
> + update_read_sectors(conf, disk, this_sector, read_len);
> + *max_sectors = read_len;
> + return disk;
> + }
> + }
> +
> + return -1;
> +}
> +
> /*
> * This routine returns the disk from which the requested read should
> * be done. There is a per-array 'next expected sequential IO' sector
> @@ -603,7 +644,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> sector_t best_dist;
> unsigned int min_pending;
> struct md_rdev *rdev;
> - int choose_first;
>
> retry:
> sectors = r1_bio->sectors;
> @@ -613,10 +653,11 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> best_pending_disk = -1;
> min_pending = UINT_MAX;
> best_good_sectors = 0;
> - choose_first = raid1_should_read_first(conf->mddev, this_sector,
> - sectors);
> clear_bit(R1BIO_FailFast, &r1_bio->state);
>
> + if (raid1_should_read_first(conf->mddev, this_sector, sectors))
> + return choose_first_rdev(conf, r1_bio, max_sectors);
> +
> for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
> sector_t dist;
> sector_t first_bad;
> @@ -662,8 +703,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> * bad_sectors from another device..
> */
> bad_sectors -= (this_sector - first_bad);
> - if (choose_first && sectors > bad_sectors)
> - sectors = bad_sectors;
> if (best_good_sectors > sectors)
> best_good_sectors = sectors;
>
> @@ -673,8 +712,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> best_good_sectors = good_sectors;
> best_disk = disk;
> }
> - if (choose_first)
> - break;
> }
> continue;
> } else {
> @@ -689,10 +726,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>
> pending = atomic_read(&rdev->nr_pending);
> dist = abs(this_sector - conf->mirrors[disk].head_position);
> - if (choose_first) {
> - best_disk = disk;
> - break;
> - }
> /* Don't change to another disk for sequential reads */
> if (conf->mirrors[disk].next_seq_sect == this_sector
> || dist == 0) {
> @@ -760,13 +793,9 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> rdev = conf->mirrors[best_disk].rdev;
> if (!rdev)
> goto retry;
> - atomic_inc(&rdev->nr_pending);
> - sectors = best_good_sectors;
> -
> - if (conf->mirrors[best_disk].next_seq_sect != this_sector)
> - conf->mirrors[best_disk].seq_start = this_sector;
>
> - conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
> + sectors = best_good_sectors;
> + update_read_sectors(conf, disk, this_sector, sectors);
> }
> *max_sectors = sectors;
>
> --
> 2.39.2
>
>
Hi
This patch looks good to me.
Reviewed-by: Xiao Ni <[email protected]>
On Thu, Feb 22, 2024 at 4:06 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> The way that best rdev is chosen:
>
> 1) If the read is sequential from one rdev:
> - if rdev is rotational, use this rdev;
> - if rdev is non-rotational, use this rdev until total read length
> exceed disk opt io size;
>
> 2) If the read is not sequential:
> - if there is idle disk, use it, otherwise:
> - if the array has non-rotational disk, choose the rdev with minimal
> inflight IO;
> - if all the underlaying disks are rotational disk, choose the rdev
> with closest IO;
>
> There are no functional changes, just to make code cleaner and prepare
> for following refactor.
>
> Co-developed-by: Paul Luse <[email protected]>
> Signed-off-by: Paul Luse <[email protected]>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/raid1.c | 171 ++++++++++++++++++++++++---------------------
> 1 file changed, 92 insertions(+), 79 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 223ef8d06f67..938b0e0170df 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -730,73 +730,68 @@ static bool should_choose_next(struct r1conf *conf, int disk)
> mirror->next_seq_sect - opt_iosize >= mirror->seq_start;
> }
>
> -/*
> - * This routine returns the disk from which the requested read should
> - * be done. There is a per-array 'next expected sequential IO' sector
> - * number - if this matches on the next IO then we use the last disk.
> - * There is also a per-disk 'last know head position' sector that is
> - * maintained from IRQ contexts, both the normal and the resync IO
> - * completion handlers update this position correctly. If there is no
> - * perfect sequential match then we pick the disk whose head is closest.
> - *
> - * If there are 2 mirrors in the same 2 devices, performance degrades
> - * because position is mirror, not device based.
> - *
> - * The rdev for the device selected will have nr_pending incremented.
> - */
> -static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sectors)
> +static bool rdev_readable(struct md_rdev *rdev, struct r1bio *r1_bio)
> {
> - const sector_t this_sector = r1_bio->sector;
> - int sectors;
> - int best_good_sectors;
> - int best_disk, best_dist_disk, best_pending_disk;
> - int disk;
> - sector_t best_dist;
> - unsigned int min_pending;
> - struct md_rdev *rdev;
> + if (!rdev || test_bit(Faulty, &rdev->flags))
> + return false;
>
> - retry:
> - sectors = r1_bio->sectors;
> - best_disk = -1;
> - best_dist_disk = -1;
> - best_dist = MaxSector;
> - best_pending_disk = -1;
> - min_pending = UINT_MAX;
> - best_good_sectors = 0;
> - clear_bit(R1BIO_FailFast, &r1_bio->state);
> + /* still in recovery */
> + if (!test_bit(In_sync, &rdev->flags) &&
> + rdev->recovery_offset < r1_bio->sector + r1_bio->sectors)
> + return false;
>
> - if (raid1_should_read_first(conf->mddev, this_sector, sectors))
> - return choose_first_rdev(conf, r1_bio, max_sectors);
> + /* don't read from slow disk unless have to */
> + if (test_bit(WriteMostly, &rdev->flags))
> + return false;
> +
> + /* don't split IO for bad blocks unless have to */
> + if (rdev_has_badblock(rdev, r1_bio->sector, r1_bio->sectors))
> + return false;
> +
> + return true;
> +}
> +
> +struct read_balance_ctl {
> + sector_t closest_dist;
> + int closest_dist_disk;
> + int min_pending;
> + int min_pending_disk;
> + int readable_disks;
> +};
> +
> +static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
> +{
> + int disk;
> + struct read_balance_ctl ctl = {
> + .closest_dist_disk = -1,
> + .closest_dist = MaxSector,
> + .min_pending_disk = -1,
> + .min_pending = UINT_MAX,
> + };
>
> for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
> + struct md_rdev *rdev;
> sector_t dist;
> unsigned int pending;
>
> - rdev = conf->mirrors[disk].rdev;
> - if (r1_bio->bios[disk] == IO_BLOCKED
> - || rdev == NULL
> - || test_bit(Faulty, &rdev->flags))
> - continue;
> - if (!test_bit(In_sync, &rdev->flags) &&
> - rdev->recovery_offset < this_sector + sectors)
> - continue;
> - if (test_bit(WriteMostly, &rdev->flags))
> + if (r1_bio->bios[disk] == IO_BLOCKED)
> continue;
> - if (rdev_has_badblock(rdev, this_sector, sectors))
> +
> + rdev = conf->mirrors[disk].rdev;
> + if (!rdev_readable(rdev, r1_bio))
> continue;
>
> - if (best_disk >= 0)
> - /* At least two disks to choose from so failfast is OK */
> + /* At least two disks to choose from so failfast is OK */
> + if (ctl.readable_disks++ == 1)
> set_bit(R1BIO_FailFast, &r1_bio->state);
>
> pending = atomic_read(&rdev->nr_pending);
> - dist = abs(this_sector - conf->mirrors[disk].head_position);
> + dist = abs(r1_bio->sector - conf->mirrors[disk].head_position);
> +
> /* Don't change to another disk for sequential reads */
> if (is_sequential(conf, disk, r1_bio)) {
> - if (!should_choose_next(conf, disk)) {
> - best_disk = disk;
> - break;
> - }
> + if (!should_choose_next(conf, disk))
> + return disk;
>
> /*
> * Add 'pending' to avoid choosing this disk if there is
> @@ -810,42 +805,60 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> dist = 0;
> }
>
> - if (min_pending > pending) {
> - min_pending = pending;
> - best_pending_disk = disk;
> + if (ctl.min_pending > pending) {
> + ctl.min_pending = pending;
> + ctl.min_pending_disk = disk;
> }
>
> - if (dist < best_dist) {
> - best_dist = dist;
> - best_dist_disk = disk;
> + if (dist < ctl.closest_dist) {
> + ctl.closest_dist = dist;
> + ctl.closest_dist_disk = disk;
> }
> }
>
> - /*
> - * If all disks are rotational, choose the closest disk. If any disk is
> - * non-rotational, choose the disk with less pending request even the
> - * disk is rotational, which might/might not be optimal for raids with
> - * mixed ratation/non-rotational disks depending on workload.
> - */
> - if (best_disk == -1) {
> - if (conf->mddev->nonrot_disks || min_pending == 0)
> - best_disk = best_pending_disk;
> - else
> - best_disk = best_dist_disk;
> - }
>
> - if (best_disk >= 0) {
> - rdev = conf->mirrors[best_disk].rdev;
> - if (!rdev)
> - goto retry;
> + if (ctl.min_pending_disk != -1 &&
> + (conf->mddev->nonrot_disks || ctl.min_pending == 0))
> + return ctl.min_pending_disk;
> + else
> + return ctl.closest_dist_disk;
> +}
>
> - sectors = best_good_sectors;
> - update_read_sectors(conf, disk, this_sector, sectors);
> - }
> - *max_sectors = sectors;
> +/*
> + * This routine returns the disk from which the requested read should be done.
> + *
> + * 1) If resync is in progress, find the first usable disk and use
> + * it even if it has some bad blocks.
> + *
> + * 2) Now that there is no resync, loop through all disks and skipping slow
> + * disks and disks with bad blocks for now. Only pay attention to key disk
> + * choice.
> + *
> + * 3) If we've made it this far, now look for disks with bad blocks and choose
> + * the one with most number of sectors.
> + *
> + * 4) If we are all the way at the end, we have no choice but to use a disk even
> + * if it is write mostly.
> +
> + * The rdev for the device selected will have nr_pending incremented.
> + */
> +static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sectors)
> +{
> + int disk;
> +
> + clear_bit(R1BIO_FailFast, &r1_bio->state);
>
> - if (best_disk >= 0)
> - return best_disk;
> + if (raid1_should_read_first(conf->mddev, r1_bio->sector,
> + r1_bio->sectors))
> + return choose_first_rdev(conf, r1_bio, max_sectors);
> +
> + disk = choose_best_rdev(conf, r1_bio);
> + if (disk >= 0) {
> + *max_sectors = r1_bio->sectors;
> + update_read_sectors(conf, disk, r1_bio->sector,
> + r1_bio->sectors);
> + return disk;
> + }
>
> /*
> * If we are here it means we didn't find a perfectly good disk so
> --
> 2.39.2
>
>
Hi
This patch looks good to me. Thanks very much for the effort. Now the
read_balance is more easy to read and understand.
Reviewed-by: Xiao Ni <[email protected]>
On Tue, Feb 27, 2024 at 10:38 AM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2024/02/27 10:23, Xiao Ni 写道:
> > On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
> >>
> >> From: Yu Kuai <[email protected]>
> >>
> >> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> >> the case choose next idle in read_balance():
> >>
> >> read_balance:
> >> for_each_rdev
> >> if(next_seq_sect == this_sector || disk == 0)
> >> -> sequential reads
> >> best_disk = disk;
> >> if (...)
> >> choose_next_idle = 1
> >> continue;
> >>
> >> for_each_rdev
> >> -> iterate next rdev
> >> if (pending == 0)
> >> best_disk = disk;
> >> -> choose the next idle disk
> >> break;
> >>
> >> if (choose_next_idle)
> >> -> keep using this rdev if there are no other idle disk
> >> contine
> >>
> >> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >> remove the code:
> >>
> >> - /* If device is idle, use it */
> >> - if (pending == 0) {
> >> - best_disk = disk;
> >> - break;
> >> - }
> >>
> >> Hence choose next idle will never work now, fix this problem by
> >> following:
> >>
> >> 1) don't set best_disk in this case, read_balance() will choose the best
> >> disk after iterating all the disks;
> >> 2) add 'pending' so that other idle disk will be chosen;
> >> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
> >> are rotational, this disk will still be chosen;
> >>
> >> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >> Co-developed-by: Paul Luse <[email protected]>
> >> Signed-off-by: Paul Luse <[email protected]>
> >> Signed-off-by: Yu Kuai <[email protected]>
> >> ---
> >> drivers/md/raid1.c | 21 ++++++++++++---------
> >> 1 file changed, 12 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> index c60ea58ae8c5..d0bc67e6d068 100644
> >> --- a/drivers/md/raid1.c
> >> +++ b/drivers/md/raid1.c
> >> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> unsigned int min_pending;
> >> struct md_rdev *rdev;
> >> int choose_first;
> >> - int choose_next_idle;
> >>
> >> /*
> >> * Check if we can balance. We can balance on the whole
> >> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> best_pending_disk = -1;
> >> min_pending = UINT_MAX;
> >> best_good_sectors = 0;
> >> - choose_next_idle = 0;
> >> clear_bit(R1BIO_FailFast, &r1_bio->state);
> >>
> >> if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> >> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
> >> struct raid1_info *mirror = &conf->mirrors[disk];
> >>
> >> - best_disk = disk;
> >> /*
> >> * If buffered sequential IO size exceeds optimal
> >> * iosize, check if there is idle disk. If yes, choose
> >> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >> mirror->next_seq_sect > opt_iosize &&
> >> mirror->next_seq_sect - opt_iosize >=
> >> mirror->seq_start) {
> >> - choose_next_idle = 1;
> >> - continue;
> >> + /*
> >> + * Add 'pending' to avoid choosing this disk if
> >> + * there is other idle disk.
> >> + * Set 'dist' to 0, so that if there is no other
> >> + * idle disk and all disks are rotational, this
> >> + * disk will still be chosen.
> >> + */
> >> + pending++;
> >> + dist = 0;
> >> + } else {
> >> + best_disk = disk;
> >> + break;
> >> }
> >> - break;
> >> }
> >
> > Hi Kuai
> >
> > I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's
> > a sequential read. If there are no other idle disks, it will read from
> > the sequential disk. With this patch, it reads from the
> > best_pending_disk even min_pending is not 0. It looks like a wrong
> > behaviour?
>
> Yes, nice catch, I didn't notice this yet... So there is a hidden
> logical, sequential IO priority is higher than minimal 'pending'
> selection, it's only less than 'choose_next_idle' where idle disk
> exist.
Yes.
>
> Looks like if we want to keep this behaviour, we can add a 'sequential
> disk':
>
> if (is_sequential())
> if (!should_choose_next())
> return disk;
> ctl.sequential_disk = disk;
>
> ...
>
> if (ctl.min_pending != 0 && ctl.sequential_disk != -1)
> return ctl.sequential_disk;
Agree with this, thanks :)
Best Regards
Xiao
>
> Thanks,
> Kuai
>
> >
> > Best Regards
> > Xiao
> >>
> >> - if (choose_next_idle)
> >> - continue;
> >> -
> >> if (min_pending > pending) {
> >> min_pending = pending;
> >> best_pending_disk = disk;
> >> --
> >> 2.39.2
> >>
> >>
> >
> > .
> >
>
> On Feb 26, 2024, at 9:49 PM, Xiao Ni <[email protected]> wrote:
>
> On Tue, Feb 27, 2024 at 10:38 AM Yu Kuai <[email protected]> wrote:
>>
>> Hi,
>>
>> 在 2024/02/27 10:23, Xiao Ni 写道:
>>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <[email protected]> wrote:
>>>>
>>>> From: Yu Kuai <[email protected]>
>>>>
>>>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
>>>> the case choose next idle in read_balance():
>>>>
>>>> read_balance:
>>>> for_each_rdev
>>>> if(next_seq_sect == this_sector || disk == 0)
>>>> -> sequential reads
>>>> best_disk = disk;
>>>> if (...)
>>>> choose_next_idle = 1
>>>> continue;
>>>>
>>>> for_each_rdev
>>>> -> iterate next rdev
>>>> if (pending == 0)
>>>> best_disk = disk;
>>>> -> choose the next idle disk
>>>> break;
>>>>
>>>> if (choose_next_idle)
>>>> -> keep using this rdev if there are no other idle disk
>>>> contine
>>>>
>>>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> remove the code:
>>>>
>>>> - /* If device is idle, use it */
>>>> - if (pending == 0) {
>>>> - best_disk = disk;
>>>> - break;
>>>> - }
>>>>
>>>> Hence choose next idle will never work now, fix this problem by
>>>> following:
>>>>
>>>> 1) don't set best_disk in this case, read_balance() will choose the best
>>>> disk after iterating all the disks;
>>>> 2) add 'pending' so that other idle disk will be chosen;
>>>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>>>> are rotational, this disk will still be chosen;
>>>>
>>>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> Co-developed-by: Paul Luse <[email protected]>
>>>> Signed-off-by: Paul Luse <[email protected]>
>>>> Signed-off-by: Yu Kuai <[email protected]>
>>>> ---
>>>> drivers/md/raid1.c | 21 ++++++++++++---------
>>>> 1 file changed, 12 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>> index c60ea58ae8c5..d0bc67e6d068 100644
>>>> --- a/drivers/md/raid1.c
>>>> +++ b/drivers/md/raid1.c
>>>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> unsigned int min_pending;
>>>> struct md_rdev *rdev;
>>>> int choose_first;
>>>> - int choose_next_idle;
>>>>
>>>> /*
>>>> * Check if we can balance. We can balance on the whole
>>>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> best_pending_disk = -1;
>>>> min_pending = UINT_MAX;
>>>> best_good_sectors = 0;
>>>> - choose_next_idle = 0;
>>>> clear_bit(R1BIO_FailFast, &r1_bio->state);
>>>>
>>>> if ((conf->mddev->recovery_cp < this_sector + sectors) ||
>>>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>>>> struct raid1_info *mirror = &conf->mirrors[disk];
>>>>
>>>> - best_disk = disk;
>>>> /*
>>>> * If buffered sequential IO size exceeds optimal
>>>> * iosize, check if there is idle disk. If yes, choose
>>>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> mirror->next_seq_sect > opt_iosize &&
>>>> mirror->next_seq_sect - opt_iosize >=
>>>> mirror->seq_start) {
>>>> - choose_next_idle = 1;
>>>> - continue;
>>>> + /*
>>>> + * Add 'pending' to avoid choosing this disk if
>>>> + * there is other idle disk.
>>>> + * Set 'dist' to 0, so that if there is no other
>>>> + * idle disk and all disks are rotational, this
>>>> + * disk will still be chosen.
>>>> + */
>>>> + pending++;
>>>> + dist = 0;
>>>> + } else {
>>>> + best_disk = disk;
>>>> + break;
>>>> }
>>>> - break;
>>>> }
>>>
>>> Hi Kuai
>>>
>>> I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's
>>> a sequential read. If there are no other idle disks, it will read from
>>> the sequential disk. With this patch, it reads from the
>>> best_pending_disk even min_pending is not 0. It looks like a wrong
>>> behaviour?
>>
>> Yes, nice catch, I didn't notice this yet... So there is a hidden
>> logical, sequential IO priority is higher than minimal 'pending'
>> selection, it's only less than 'choose_next_idle' where idle disk
>> exist.
>
> Yes.
>
>
>>
>> Looks like if we want to keep this behaviour, we can add a 'sequential
>> disk':
>>
>> if (is_sequential())
>> if (!should_choose_next())
>> return disk;
>> ctl.sequential_disk = disk;
>>
>> ...
>>
>> if (ctl.min_pending != 0 && ctl.sequential_disk != -1)
>> return ctl.sequential_disk;
>
> Agree with this, thanks :)
>
> Best Regards
> Xiao
Yup, agree as well. This will help for sure with the followup to this series for seq read improvements :)
>>
>> Thanks,
>> Kuai
>>
>>>
>>> Best Regards
>>> Xiao
>>>>
>>>> - if (choose_next_idle)
>>>> - continue;
>>>> -
>>>> if (min_pending > pending) {
>>>> min_pending = pending;
>>>> best_pending_disk = disk;
>>>> --
>>>> 2.39.2
>>>>
>>>>
>>>
>>> .
>>>
>>
>
>