2016-11-18 05:16:56

by NeilBrown

[permalink] [raw]
Subject: [PATCH/RFC] add "failfast" support for raid1/raid10.

Hi,

I've been sitting on these patches for a while because although they
solve a real problem, it is a fairly limited use-case, and I don't
really like some of the details.

So I'm posting them as RFC in the hope that a different perspective
might help me like them better, or find a better approach.

The core idea is that when you have multiple copies of data
(i.e. mirrored drives) it doesn't make sense to wait for a read from
a drive that seems to be having problems. It will probably be faster
to just cancel that read, and read from the other device.
Similarly, in some circumstances, it might be better to fail a drive
that is being slow to respond to writes, rather than cause all writes
to be very slow.

The particular context where this comes up is when mirroring across
storage arrays, where the storage arrays can temporarily take an
unusually long time to respond to requests (firmware updates have
been mentioned). As the array will have redundancy internally, there
is little risk to the data. The mirrored pair is really only for
disaster recovery, and it is deemed better to lose the last few
minutes of updates in the case of a serious disaster, rather than
occasionally having latency issues because one array needs to do some
maintenance for a few minutes. The particular storage arrays in
question are DASD devices which are part of the s390 ecosystem.

Linux block layer has "failfast" flags to direct drivers to fail more
quickly. These patches allow devices in an md array to be given a
"failfast" flag, which will cause IO requests to be marked as
"failfast" providing there is another device available. Once the
array becomes degraded, we stop using failfast, as that could result
in data loss.

I don't like the whole "failfast" concept because it is not at all
clear how fast "fast" is. In fact, these block-layer flags are
really a misnomer. They should be "noretry" flags.
REQ_FAILFAST_DEV means "don't retry requests which reported an error
which seems to come from the device.
REQ_FAILFAST_TRANSPORT means "don't retry requests which seem to
indicate a problem with the transport, rather than the device"
REQ_FAILFAST_DRIVER means .... I'm not exactly sure. I think it
means whatever a particular driver wants it to mean, basically "I
cannot seem to handle this right now, just resend and I'll probably
be more in control next time". It seems to be for internal-use only.

Multipath code uses REQ_FAILFAST_TRANSPORT only, which makes sense.
btrfs uses REQ_FAILFAST_DEV only (for read-ahead) which doesn't seem
to make sense.... why would you ever use _DEV without _TRANSPORT?

None of these actually change the timeouts in the driver or in the
device, which is what I would expect for "failfast", so to get real
"fast failure" you need to enable failfast, and adjust the timeouts.
That is what we do for our customers with DASD.

Anyway, it seems to make sense to use _TRANSPORT and _DEV for
requests from md where there is somewhere to fall-back on.
If we get an error from a "failfast" request, and the array is still
non-degraded, we just fail the device. We don't try to repair read
errors (which is pointless on storage arrays).

It is assumed that some user-space code will notice the failure,
monitor the device to see when it becomes available again, and then
--re-add it. Assuming the array has a bitmap, the --re-add should be
fast and the array will become optimal again without experiencing
excessive latencies.

My two main concerns are:
- does this functionality have any use-case outside of mirrored
storage arrays, and are there other storage arrays which
occasionally inserted excessive latency (seems like a serious
misfeature to me, but I know few of the details)?
- would it be at all possible to have "real" failfast functionality
in the block layer? I.e. something that is based on time rather
than retry count. Maybe in some cases a retry would be
appropriate if the first failure was very fast.
I.e. it would reduce timeouts and decide on retries based on
elapsed time rather than number of attempts.
With this would come the question of "how fast is fast" and I
don't have a really good answer. Maybe md would need to set a
timeout, which it would double whenever it got failures on all
drives. Otherwise the timeout would drift towards (say) 10 times
the typical response time.

So: comments most welcome. As I say, this does address a genuine
need. Just find it hard to like it :-(


Thanks,
NeilBrown

---

NeilBrown (6):
md/failfast: add failfast flag for md to be used by some personalities.
md: Use REQ_FAILFAST_* on metadata writes where appropriate
md/raid1: add failfast handling for reads.
md/raid1: add failfast handling for writes.
md/raid10: add failfast handling for reads.
md/raid10: add failfast handling for writes.


drivers/md/bitmap.c | 15 ++++++--
drivers/md/md.c | 71 +++++++++++++++++++++++++++++++-----
drivers/md/md.h | 27 +++++++++++++-
drivers/md/raid1.c | 79 ++++++++++++++++++++++++++++++++++------
drivers/md/raid1.h | 1 +
drivers/md/raid10.c | 79 +++++++++++++++++++++++++++++++++++++---
drivers/md/raid10.h | 2 +
include/uapi/linux/raid/md_p.h | 7 +++-
8 files changed, 249 insertions(+), 32 deletions(-)

--
Signature


2016-11-18 05:17:04

by NeilBrown

[permalink] [raw]
Subject: [md PATCH 1/6] md/failfast: add failfast flag for md to be used by some personalities.

This patch just adds a 'failfast' per-device flag which can be stored
in v0.90 or v1.x metadata.
The flag is not used yet but the intent is that it can be used for
mirrored (raid1/raid10) arrays where low latency is more important
than keeping all devices on-line.

Setting the flag for a device effectively gives permission for that
device to be marked as Faulty and excluded from the array on the first
error. The underlying driver will be directed not to retry requests
that result in failures. There is a proviso that the device must not
be marked faulty if that would cause the array as a whole to fail, it
may only be marked Faulty if the array remains functional, but is
degraded.

Failures on read requests will cause the device to be marked
as Faulty immediately so that further reads will avoid that
device. No attempt will be made to correct read errors by
over-writing with the correct data.

It is expected that if transient errors, such as cable unplug, are
possible, then something in user-space will revalidate failed
devices and re-add them when they appear to be working again.

Signed-off-by: NeilBrown <[email protected]>
---
drivers/md/md.c | 27 +++++++++++++++++++++++++++
drivers/md/md.h | 6 ++++++
include/uapi/linux/raid/md_p.h | 7 ++++++-
3 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1f1c7f007b68..0d1a9fd9d1c1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1163,6 +1163,8 @@ static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev)
}
if (desc->state & (1<<MD_DISK_WRITEMOSTLY))
set_bit(WriteMostly, &rdev->flags);
+ if (desc->state & (1<<MD_DISK_FAILFAST))
+ set_bit(FailFast, &rdev->flags);
} else /* MULTIPATH are always insync */
set_bit(In_sync, &rdev->flags);
return 0;
@@ -1288,6 +1290,8 @@ static void super_90_sync(struct mddev *mddev, struct md_rdev *rdev)
}
if (test_bit(WriteMostly, &rdev2->flags))
d->state |= (1<<MD_DISK_WRITEMOSTLY);
+ if (test_bit(FailFast, &rdev2->flags))
+ d->state |= (1<<MD_DISK_FAILFAST);
}
/* now set the "removed" and "faulty" bits on any missing devices */
for (i=0 ; i < mddev->raid_disks ; i++) {
@@ -1672,6 +1676,8 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
}
if (sb->devflags & WriteMostly1)
set_bit(WriteMostly, &rdev->flags);
+ if (sb->devflags & FailFast1)
+ set_bit(FailFast, &rdev->flags);
if (le32_to_cpu(sb->feature_map) & MD_FEATURE_REPLACEMENT)
set_bit(Replacement, &rdev->flags);
} else /* MULTIPATH are always insync */
@@ -1710,6 +1716,10 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
sb->chunksize = cpu_to_le32(mddev->chunk_sectors);
sb->level = cpu_to_le32(mddev->level);
sb->layout = cpu_to_le32(mddev->layout);
+ if (test_bit(FailFast, &rdev->flags))
+ sb->devflags |= FailFast1;
+ else
+ sb->devflags &= ~FailFast1;

if (test_bit(WriteMostly, &rdev->flags))
sb->devflags |= WriteMostly1;
@@ -2554,6 +2564,8 @@ state_show(struct md_rdev *rdev, char *page)
len += sprintf(page+len, "replacement%s", sep);
if (test_bit(ExternalBbl, &flags))
len += sprintf(page+len, "external_bbl%s", sep);
+ if (test_bit(FailFast, &flags))
+ len += sprintf(page+len, "failfast%s", sep);

if (len)
len -= strlen(sep);
@@ -2576,6 +2588,7 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
* so that it gets rebuilt based on bitmap
* write_error - sets WriteErrorSeen
* -write_error - clears WriteErrorSeen
+ * {,-}failfast - set/clear FailFast
*/
int err = -EINVAL;
if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
@@ -2634,6 +2647,12 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
} else if (cmd_match(buf, "insync") && rdev->raid_disk == -1) {
set_bit(In_sync, &rdev->flags);
err = 0;
+ } else if (cmd_match(buf, "failfast")) {
+ set_bit(FailFast, &rdev->flags);
+ err = 0;
+ } else if (cmd_match(buf, "-failfast")) {
+ clear_bit(FailFast, &rdev->flags);
+ err = 0;
} else if (cmd_match(buf, "-insync") && rdev->raid_disk >= 0 &&
!test_bit(Journal, &rdev->flags)) {
if (rdev->mddev->pers == NULL) {
@@ -5939,6 +5958,8 @@ static int get_disk_info(struct mddev *mddev, void __user * arg)
info.state |= (1<<MD_DISK_JOURNAL);
if (test_bit(WriteMostly, &rdev->flags))
info.state |= (1<<MD_DISK_WRITEMOSTLY);
+ if (test_bit(FailFast, &rdev->flags))
+ info.state |= (1<<MD_DISK_FAILFAST);
} else {
info.major = info.minor = 0;
info.raid_disk = -1;
@@ -6046,6 +6067,10 @@ static int add_new_disk(struct mddev *mddev, mdu_disk_info_t *info)
set_bit(WriteMostly, &rdev->flags);
else
clear_bit(WriteMostly, &rdev->flags);
+ if (info->state & (1<<MD_DISK_FAILFAST))
+ set_bit(FailFast, &rdev->flags);
+ else
+ clear_bit(FailFast, &rdev->flags);

if (info->state & (1<<MD_DISK_JOURNAL)) {
struct md_rdev *rdev2;
@@ -6135,6 +6160,8 @@ static int add_new_disk(struct mddev *mddev, mdu_disk_info_t *info)

if (info->state & (1<<MD_DISK_WRITEMOSTLY))
set_bit(WriteMostly, &rdev->flags);
+ if (info->state & (1<<MD_DISK_FAILFAST))
+ set_bit(FailFast, &rdev->flags);

if (!mddev->persistent) {
pr_debug("md: nonpersistent superblock ...\n");
diff --git a/drivers/md/md.h b/drivers/md/md.h
index af6b33c30d2d..bc6712ef8c81 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -171,6 +171,12 @@ enum flag_bits {
ExternalBbl, /* External metadata provides bad
* block management for a disk
*/
+ FailFast, /* Minimal retries should be attempted on
+ * this device, so use REQ_FAILFAST_DEV.
+ * Also don't try to repair failed reads.
+ * It is expects that no bad block log
+ * is present.
+ */
};

static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index c3e654c6d518..9930f3e9040f 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -84,6 +84,10 @@
#define MD_DISK_CANDIDATE 5 /* disk is added as spare (local) until confirmed
* For clustered enviroments only.
*/
+#define MD_DISK_FAILFAST 10 /* Send REQ_FAILFAST if there are multiple
+ * devices available - and don't try to
+ * correct read errors.
+ */

#define MD_DISK_WRITEMOSTLY 9 /* disk is "write-mostly" is RAID1 config.
* read requests will only be sent here in
@@ -265,8 +269,9 @@ struct mdp_superblock_1 {
__le32 dev_number; /* permanent identifier of this device - not role in raid */
__le32 cnt_corrected_read; /* number of read errors that were corrected by re-writing */
__u8 device_uuid[16]; /* user-space setable, ignored by kernel */
- __u8 devflags; /* per-device flags. Only one defined...*/
+ __u8 devflags; /* per-device flags. Only two defined...*/
#define WriteMostly1 1 /* mask for writemostly flag in above */
+#define FailFast1 2 /* Should avoid retries and fixups and just fail */
/* Bad block log. If there are any bad blocks the feature flag is set.
* If offset and size are non-zero, that space is reserved and available
*/


2016-11-18 05:17:18

by NeilBrown

[permalink] [raw]
Subject: [md PATCH 3/6] md/raid1: add failfast handling for reads.

If a device is marked FailFast and it is not the only device
we can read from, we mark the bio with REQ_FAILFAST_* flags.

If this does fail, we don't try read repair but just allow
failure. If it was the last device it doesn't fail of
course, so the retry happens on the same device - this time
without FAILFAST. A subsequent failure will not retry but
will just pass up the error.

During resync we may use FAILFAST requests and on a failure
we will simply use the other device(s).

During recovery we will only use FAILFAST in the unusual
case were there are multiple places to read from - i.e. if
there are > 2 devices. If we get a failure we will fail the
device and complete the resync/recovery with remaining
devices.

The new R1BIO_FailFast flag is set on read reqest to suggest
the a FAILFAST request might be acceptable. The rdev needs
to have FailFast set as well for the read to actually use
REQ_FAILFAST_*.

We need to know there are at least two working devices
before we can set R1BIO_FailFast, so we mustn't stop looking
at the first device we find. So the "min_pending == 0"
handling to not exit early, but too always choose the
best_pending_disk if min_pending == 0.

The spinlocked region in raid1_error() in enlarged to ensure
that if two bios, reading from two different devices, fail
at the same time, then there is no risk that both devices
will be marked faulty, leaving zero "In_sync" devices.

Signed-off-by: NeilBrown <[email protected]>
---
drivers/md/raid1.c | 52 ++++++++++++++++++++++++++++++++++++++++++----------
drivers/md/raid1.h | 1 +
2 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index cfd73730e6af..44f93297698d 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -330,6 +330,11 @@ static void raid1_end_read_request(struct bio *bio)

if (uptodate)
set_bit(R1BIO_Uptodate, &r1_bio->state);
+ else if (test_bit(FailFast, &rdev->flags) &&
+ test_bit(R1BIO_FailFast, &r1_bio->state))
+ /* This was a fail-fast read so we definitely
+ * want to retry */
+ ;
else {
/* If all other devices have failed, we want to return
* the error upwards rather than fail the last device.
@@ -536,6 +541,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
best_good_sectors = 0;
has_nonrot_disk = 0;
choose_next_idle = 0;
+ clear_bit(R1BIO_FailFast, &r1_bio->state);

if ((conf->mddev->recovery_cp < this_sector + sectors) ||
(mddev_is_clustered(conf->mddev) &&
@@ -609,6 +615,10 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
} else
best_good_sectors = sectors;

+ if (best_disk >= 0)
+ /* At least two disks to choose from so failfast is OK */
+ set_bit(R1BIO_FailFast, &r1_bio->state);
+
nonrot = blk_queue_nonrot(bdev_get_queue(rdev->bdev));
has_nonrot_disk |= nonrot;
pending = atomic_read(&rdev->nr_pending);
@@ -647,11 +657,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
}
break;
}
- /* If device is idle, use it */
- if (pending == 0) {
- best_disk = disk;
- break;
- }

if (choose_next_idle)
continue;
@@ -674,7 +679,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
* mixed ratation/non-rotational disks depending on workload.
*/
if (best_disk == -1) {
- if (has_nonrot_disk)
+ if (has_nonrot_disk || min_pending == 0)
best_disk = best_pending_disk;
else
best_disk = best_dist_disk;
@@ -1168,6 +1173,9 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
read_bio->bi_bdev = mirror->rdev->bdev;
read_bio->bi_end_io = raid1_end_read_request;
bio_set_op_attrs(read_bio, op, do_sync);
+ if (test_bit(FailFast, &mirror->rdev->flags) &&
+ test_bit(R1BIO_FailFast, &r1_bio->state))
+ read_bio->bi_opf |= MD_FAILFAST;
read_bio->bi_private = r1_bio;

if (mddev->gendisk)
@@ -1465,6 +1473,7 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
* next level up know.
* else mark the drive as failed
*/
+ spin_lock_irqsave(&conf->device_lock, flags);
if (test_bit(In_sync, &rdev->flags)
&& (conf->raid_disks - mddev->degraded) == 1) {
/*
@@ -1474,10 +1483,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
* it is very likely to fail.
*/
conf->recovery_disabled = mddev->recovery_disabled;
+ spin_unlock_irqrestore(&conf->device_lock, flags);
return;
}
set_bit(Blocked, &rdev->flags);
- spin_lock_irqsave(&conf->device_lock, flags);
if (test_and_clear_bit(In_sync, &rdev->flags)) {
mddev->degraded++;
set_bit(Faulty, &rdev->flags);
@@ -1816,12 +1825,24 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
sector_t sect = r1_bio->sector;
int sectors = r1_bio->sectors;
int idx = 0;
+ struct md_rdev *rdev;
+
+ rdev = conf->mirrors[r1_bio->read_disk].rdev;
+ if (test_bit(FailFast, &rdev->flags)) {
+ /* Don't try recovering from here - just fail it
+ * ... unless it is the last working device of course */
+ md_error(mddev, rdev);
+ if (test_bit(Faulty, &rdev->flags))
+ /* Don't try to read from here, but make sure
+ * put_buf does it's thing
+ */
+ bio->bi_end_io = end_sync_write;
+ }

while(sectors) {
int s = sectors;
int d = r1_bio->read_disk;
int success = 0;
- struct md_rdev *rdev;
int start;

if (s > (PAGE_SIZE>>9))
@@ -2332,7 +2353,9 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
bio_put(bio);
r1_bio->bios[r1_bio->read_disk] = NULL;

- if (mddev->ro == 0) {
+ rdev = conf->mirrors[r1_bio->read_disk].rdev;
+ if (mddev->ro == 0
+ && !test_bit(FailFast, &rdev->flags)) {
freeze_array(conf, 1);
fix_read_error(conf, r1_bio->read_disk,
r1_bio->sector, r1_bio->sectors);
@@ -2341,7 +2364,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
}

- rdev_dec_pending(conf->mirrors[r1_bio->read_disk].rdev, conf->mddev);
+ rdev_dec_pending(rdev, conf->mddev);

read_more:
disk = read_balance(conf, r1_bio, &max_sectors);
@@ -2366,6 +2389,9 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
bio->bi_bdev = rdev->bdev;
bio->bi_end_io = raid1_end_read_request;
bio_set_op_attrs(bio, REQ_OP_READ, do_sync);
+ if (test_bit(FailFast, &rdev->flags) &&
+ test_bit(R1BIO_FailFast, &r1_bio->state))
+ bio->bi_opf |= MD_FAILFAST;
bio->bi_private = r1_bio;
if (max_sectors < r1_bio->sectors) {
/* Drat - have to split this up more */
@@ -2654,6 +2680,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
bio->bi_iter.bi_sector = sector_nr + rdev->data_offset;
bio->bi_bdev = rdev->bdev;
bio->bi_private = r1_bio;
+ if (test_bit(FailFast, &rdev->flags))
+ bio->bi_opf |= MD_FAILFAST;
}
}
rcu_read_unlock();
@@ -2784,6 +2812,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
if (bio->bi_end_io == end_sync_read) {
read_targets--;
md_sync_acct(bio->bi_bdev, nr_sectors);
+ if (read_targets == 1)
+ bio->bi_opf &= ~MD_FAILFAST;
generic_make_request(bio);
}
}
@@ -2791,6 +2821,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
atomic_set(&r1_bio->remaining, 1);
bio = r1_bio->bios[r1_bio->read_disk];
md_sync_acct(bio->bi_bdev, nr_sectors);
+ if (read_targets == 1)
+ bio->bi_opf &= ~MD_FAILFAST;
generic_make_request(bio);

}
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index 5ec19449779d..c52ef424a24b 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -183,5 +183,6 @@ enum r1bio_state {
*/
R1BIO_MadeGood,
R1BIO_WriteError,
+ R1BIO_FailFast,
};
#endif


2016-11-18 05:17:10

by NeilBrown

[permalink] [raw]
Subject: [md PATCH 2/6] md: Use REQ_FAILFAST_* on metadata writes where appropriate

This can only be supported on personalities which ensure
that md_error() never causes an array to enter the 'failed'
state. i.e. if marking a device Faulty would cause some
data to be inaccessible, the device is status is left as
non-Faulty. This is true for RAID1 and RAID10.

If we get a failure writing metadata but the device doesn't
fail, it must be the last device so we re-write without
FAILFAST to improve chance of success. We also flag the
device as LastDev so that future metadata updates don't
waste time on failfast writes.

Signed-off-by: NeilBrown <[email protected]>
---
drivers/md/bitmap.c | 15 ++++++++++++---
drivers/md/md.c | 44 ++++++++++++++++++++++++++++++++++----------
drivers/md/md.h | 21 ++++++++++++++++++++-
drivers/md/raid1.c | 1 +
drivers/md/raid10.c | 1 +
5 files changed, 68 insertions(+), 14 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index cf77cbf9ed22..c4621571b718 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -209,11 +209,13 @@ static struct md_rdev *next_active_rdev(struct md_rdev *rdev, struct mddev *mdde

static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
{
- struct md_rdev *rdev = NULL;
+ struct md_rdev *rdev;
struct block_device *bdev;
struct mddev *mddev = bitmap->mddev;
struct bitmap_storage *store = &bitmap->storage;

+restart:
+ rdev = NULL;
while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
int size = PAGE_SIZE;
loff_t offset = mddev->bitmap_info.offset;
@@ -269,8 +271,8 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
page);
}

- if (wait)
- md_super_wait(mddev);
+ if (wait && md_super_wait(mddev) < 0)
+ goto restart;
return 0;

bad_alignment:
@@ -428,6 +430,13 @@ static void bitmap_wait_writes(struct bitmap *bitmap)
wait_event(bitmap->write_wait,
atomic_read(&bitmap->pending_writes)==0);
else
+ /* Note that we ignore the return value. The writes
+ * might have failed, but that would just mean that
+ * some bits which should be cleared haven't been,
+ * which is safe. The relevant bitmap blocks will
+ * probably get written again, but there is no great
+ * loss if they aren't.
+ */
md_super_wait(bitmap->mddev);
}

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0d1a9fd9d1c1..047e7db1381b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -726,7 +726,13 @@ static void super_written(struct bio *bio)
if (bio->bi_error) {
pr_err("md: super_written gets error=%d\n", bio->bi_error);
md_error(mddev, rdev);
- }
+ if (!test_bit(Faulty, &rdev->flags)
+ && (bio->bi_opf & MD_FAILFAST)) {
+ set_bit(MD_NEED_REWRITE, &mddev->flags);
+ set_bit(LastDev, &rdev->flags);
+ }
+ } else
+ clear_bit(LastDev, &rdev->flags);

if (atomic_dec_and_test(&mddev->pending_writes))
wake_up(&mddev->sb_wait);
@@ -743,7 +749,13 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
* if zero is reached.
* If an error occurred, call md_error
*/
- struct bio *bio = bio_alloc_mddev(GFP_NOIO, 1, mddev);
+ struct bio *bio;
+ int ff = 0;
+
+ if (test_bit(Faulty, &rdev->flags))
+ return;
+
+ bio = bio_alloc_mddev(GFP_NOIO, 1, mddev);

atomic_inc(&rdev->nr_pending);

@@ -752,16 +764,24 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
bio_add_page(bio, page, size, 0);
bio->bi_private = rdev;
bio->bi_end_io = super_written;
- bio_set_op_attrs(bio, REQ_OP_WRITE, WRITE_FLUSH_FUA);
+
+ if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) &&
+ test_bit(FailFast, &rdev->flags) &&
+ !test_bit(LastDev, &rdev->flags))
+ ff = MD_FAILFAST;
+ bio_set_op_attrs(bio, REQ_OP_WRITE, WRITE_FLUSH_FUA | ff);

atomic_inc(&mddev->pending_writes);
submit_bio(bio);
}

-void md_super_wait(struct mddev *mddev)
+int md_super_wait(struct mddev *mddev)
{
/* wait for all superblock writes that were scheduled to complete */
wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0);
+ if (test_and_clear_bit(MD_NEED_REWRITE, &mddev->flags))
+ return -EAGAIN;
+ return 0;
}

int sync_page_io(struct md_rdev *rdev, sector_t sector, int size,
@@ -1333,9 +1353,10 @@ super_90_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
if (IS_ENABLED(CONFIG_LBDAF) && (u64)num_sectors >= (2ULL << 32) &&
rdev->mddev->level >= 1)
num_sectors = (sector_t)(2ULL << 32) - 2;
- md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
+ do
+ md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
rdev->sb_page);
- md_super_wait(rdev->mddev);
+ while (md_super_wait(rdev->mddev) < 0);
return num_sectors;
}

@@ -1876,9 +1897,10 @@ super_1_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
sb->data_size = cpu_to_le64(num_sectors);
sb->super_offset = rdev->sb_start;
sb->sb_csum = calc_sb_1_csum(sb);
- md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
- rdev->sb_page);
- md_super_wait(rdev->mddev);
+ do
+ md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
+ rdev->sb_page);
+ while (md_super_wait(rdev->mddev) < 0);
return num_sectors;

}
@@ -2413,6 +2435,7 @@ void md_update_sb(struct mddev *mddev, int force_change)
pr_debug("md: updating %s RAID superblock on device (in sync %d)\n",
mdname(mddev), mddev->in_sync);

+rewrite:
bitmap_update_sb(mddev->bitmap);
rdev_for_each(rdev, mddev) {
char b[BDEVNAME_SIZE];
@@ -2444,7 +2467,8 @@ void md_update_sb(struct mddev *mddev, int force_change)
/* only need to write one superblock... */
break;
}
- md_super_wait(mddev);
+ if (md_super_wait(mddev) < 0)
+ goto rewrite;
/* if there was a failure, MD_CHANGE_DEVS was set, and we re-write super */

if (mddev_is_clustered(mddev) && ret == 0)
diff --git a/drivers/md/md.h b/drivers/md/md.h
index bc6712ef8c81..5c08f84101fa 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -30,6 +30,16 @@
#define MaxSector (~(sector_t)0)

/*
+ * These flags should really be called "NO_RETRY" rather than
+ * "FAILFAST" because they don't make any promise about time lapse,
+ * only about the number of retries, which will be zero.
+ * REQ_FAILFAST_DRIVER is not included because
+ * Commit: 4a27446f3e39 ("[SCSI] modify scsi to handle new fail fast flags.")
+ * seems to suggest that the errors it avoids retrying should usually
+ * be retried.
+ */
+#define MD_FAILFAST (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT)
+/*
* MD's 'extended' device
*/
struct md_rdev {
@@ -177,6 +187,10 @@ enum flag_bits {
* It is expects that no bad block log
* is present.
*/
+ LastDev, /* Seems to be the last working dev as
+ * it didn't fail, so don't use FailFast
+ * any more for metadata
+ */
};

static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
@@ -213,6 +227,11 @@ enum mddev_flags {
MD_CLUSTER_RESYNC_LOCKED, /* cluster raid only, which means node
* already took resync lock, need to
* release the lock */
+ MD_FAILFAST_SUPPORTED, /* Using MD_FAILFAST on metadata writes is
+ * supported as calls to md_error() will
+ * never cause the array to become failed.
+ */
+ MD_NEED_REWRITE, /* metadata write needs to be repeated */
};
#define MD_UPDATE_SB_FLAGS (BIT(MD_CHANGE_DEVS) | \
BIT(MD_CHANGE_CLEAN) | \
@@ -628,7 +647,7 @@ extern int mddev_congested(struct mddev *mddev, int bits);
extern void md_flush_request(struct mddev *mddev, struct bio *bio);
extern void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
sector_t sector, int size, struct page *page);
-extern void md_super_wait(struct mddev *mddev);
+extern int md_super_wait(struct mddev *mddev);
extern int sync_page_io(struct md_rdev *rdev, sector_t sector, int size,
struct page *page, int op, int op_flags,
bool metadata_op);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 96474f0af1b8..cfd73730e6af 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2989,6 +2989,7 @@ static int raid1_run(struct mddev *mddev)
mddev->thread = conf->thread;
conf->thread = NULL;
mddev->private = conf;
+ set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);

md_set_array_sectors(mddev, raid1_size(mddev, 0, 0));

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b53610c59104..763ca45b6b32 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3728,6 +3728,7 @@ static int raid10_run(struct mddev *mddev)
size = raid10_size(mddev, 0, 0);
md_set_array_sectors(mddev, size);
mddev->resync_max_sectors = size;
+ set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);

if (mddev->queue) {
int stripe = conf->geo.raid_disks *


2016-11-18 05:17:26

by NeilBrown

[permalink] [raw]
Subject: [md PATCH 4/6] md/raid1: add failfast handling for writes.

When writing to a fastfail device we use MD_FASTFAIL unless
it is the only device being written to.

For resync/recovery, assume there was a working device to
read from so always use REQ_FASTFAIL_DEV.

If a write for resync/recovery fails, we just fail the
device - there is not much else to do.

If a normal failfast write fails, but the device cannot be
failed (must be only one left), we queue for write error
handling. This will call narrow_write_error() to retry the
write synchronously and without any FAILFAST flags.

Signed-off-by: NeilBrown <[email protected]>
---
drivers/md/raid1.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 44f93297698d..731fd9fe79ef 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -423,7 +423,24 @@ static void raid1_end_write_request(struct bio *bio)
set_bit(MD_RECOVERY_NEEDED, &
conf->mddev->recovery);

- set_bit(R1BIO_WriteError, &r1_bio->state);
+ if (test_bit(FailFast, &rdev->flags) &&
+ (bio->bi_opf & MD_FAILFAST) &&
+ /* We never try FailFast to WriteMostly devices */
+ !test_bit(WriteMostly, &rdev->flags)) {
+ md_error(r1_bio->mddev, rdev);
+ if (!test_bit(Faulty, &rdev->flags))
+ /* This is the only remaining device,
+ * We need to retry the write without
+ * FailFast
+ */
+ set_bit(R1BIO_WriteError, &r1_bio->state);
+ else {
+ /* Finished with this branch */
+ r1_bio->bios[mirror] = NULL;
+ to_put = bio;
+ }
+ } else
+ set_bit(R1BIO_WriteError, &r1_bio->state);
} else {
/*
* Set R1BIO_Uptodate in our master bio, so that we
@@ -1393,6 +1410,10 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
mbio->bi_end_io = raid1_end_write_request;
bio_set_op_attrs(mbio, op, do_flush_fua | do_sync);
+ if (test_bit(FailFast, &conf->mirrors[i].rdev->flags) &&
+ !test_bit(WriteMostly, &conf->mirrors[i].rdev->flags) &&
+ conf->raid_disks - mddev->degraded > 1)
+ mbio->bi_opf |= MD_FAILFAST;
mbio->bi_private = r1_bio;

atomic_inc(&r1_bio->remaining);
@@ -2061,6 +2082,9 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
continue;

bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
+ if (test_bit(FailFast, &conf->mirrors[i].rdev->flags))
+ wbio->bi_opf |= MD_FAILFAST;
+
wbio->bi_end_io = end_sync_write;
atomic_inc(&r1_bio->remaining);
md_sync_acct(conf->mirrors[i].rdev->bdev, bio_sectors(wbio));


2016-11-18 05:17:39

by NeilBrown

[permalink] [raw]
Subject: [md PATCH 6/6] md/raid10: add failfast handling for writes.

When writing to a fastfail device, we use MD_FASTFAIL unless
it is the only device being written to. For
resync/recovery, assume there was a working device to read
from so always use MD_FASTFAIL.

If a write for resync/recovery fails, we just fail the
device - there is not much else to do.

If a normal write fails, but the device cannot be marked
Faulty (must be only one left), we queue for write error
handling which calls narrow_write_error() to write the block
synchronously without any failfast flags.

Signed-off-by: NeilBrown <[email protected]>
---
drivers/md/raid10.c | 29 ++++++++++++++++++++++++++++-
1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 99fa1b980371..c191d00055d0 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -100,6 +100,7 @@ static int max_queued_requests = 1024;
static void allow_barrier(struct r10conf *conf);
static void lower_barrier(struct r10conf *conf);
static int _enough(struct r10conf *conf, int previous, int ignore);
+static int enough(struct r10conf *conf, int ignore);
static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
int *skipped);
static void reshape_request_write(struct mddev *mddev, struct r10bio *r10_bio);
@@ -451,6 +452,7 @@ static void raid10_end_write_request(struct bio *bio)
struct r10conf *conf = r10_bio->mddev->private;
int slot, repl;
struct md_rdev *rdev = NULL;
+ struct bio *to_put = NULL;
bool discard_error;

discard_error = bio->bi_error && bio_op(bio) == REQ_OP_DISCARD;
@@ -478,8 +480,24 @@ static void raid10_end_write_request(struct bio *bio)
if (!test_and_set_bit(WantReplacement, &rdev->flags))
set_bit(MD_RECOVERY_NEEDED,
&rdev->mddev->recovery);
- set_bit(R10BIO_WriteError, &r10_bio->state);
+
dec_rdev = 0;
+ if (test_bit(FailFast, &rdev->flags) &&
+ (bio->bi_opf & MD_FAILFAST)) {
+ md_error(rdev->mddev, rdev);
+ if (!test_bit(Faulty, &rdev->flags))
+ /* This is the only remaining device,
+ * We need to retry the write without
+ * FailFast
+ */
+ set_bit(R10BIO_WriteError, &r10_bio->state);
+ else {
+ r10_bio->devs[slot].bio = NULL;
+ to_put = bio;
+ dec_rdev = 1;
+ }
+ } else
+ set_bit(R10BIO_WriteError, &r10_bio->state);
}
} else {
/*
@@ -529,6 +547,8 @@ static void raid10_end_write_request(struct bio *bio)
one_write_done(r10_bio);
if (dec_rdev)
rdev_dec_pending(rdev, conf->mddev);
+ if (to_put)
+ bio_put(to_put);
}

/*
@@ -1391,6 +1411,9 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
mbio->bi_bdev = rdev->bdev;
mbio->bi_end_io = raid10_end_write_request;
bio_set_op_attrs(mbio, op, do_sync | do_fua);
+ if (test_bit(FailFast, &conf->mirrors[d].rdev->flags) &&
+ enough(conf, d))
+ mbio->bi_opf |= MD_FAILFAST;
mbio->bi_private = r10_bio;

if (conf->mddev->gendisk)
@@ -2051,6 +2074,8 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
atomic_inc(&r10_bio->remaining);
md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(tbio));

+ if (test_bit(FailFast, &conf->mirrors[d].rdev->flags))
+ tbio->bi_opf |= MD_FAILFAST;
tbio->bi_iter.bi_sector += conf->mirrors[d].rdev->data_offset;
tbio->bi_bdev = conf->mirrors[d].rdev->bdev;
generic_make_request(tbio);
@@ -3340,6 +3365,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
bio->bi_private = r10_bio;
bio->bi_end_io = end_sync_write;
bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+ if (test_bit(FailFast, &conf->mirrors[d].rdev->flags))
+ bio->bi_opf |= MD_FAILFAST;
bio->bi_iter.bi_sector = sector + rdev->data_offset;
bio->bi_bdev = rdev->bdev;
count++;


2016-11-18 05:17:32

by NeilBrown

[permalink] [raw]
Subject: [md PATCH 5/6] md/raid10: add failfast handling for reads.

If a device is marked FailFast, and it is not the only
device we can read from, we mark the bio as MD_FAILFAST.

If this does fail-fast, we don't try read repair but just
allow failure.

If it was the last device, it doesn't get marked Faulty so
the retry happens on the same device - this time without
FAILFAST. A subsequent failure will not retry but will just
pass up the error.

During resync we may use FAILFAST requests, and on a failure
we will simply use the other device(s).

During recovery we will only use FAILFAST in the unusual
case were there are multiple places to read from - i.e. if
there are > 2 devices. If we get a failure we will fail the
device and complete the resync/recovery with remaining
devices.

Signed-off-by: NeilBrown <[email protected]>
---
drivers/md/raid10.c | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
drivers/md/raid10.h | 2 ++
2 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 763ca45b6b32..99fa1b980371 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -720,6 +720,7 @@ static struct md_rdev *read_balance(struct r10conf *conf,
best_dist = MaxSector;
best_good_sectors = 0;
do_balance = 1;
+ clear_bit(R10BIO_FailFast, &r10_bio->state);
/*
* Check if we can balance. We can balance on the whole
* device if no resync is going on (recovery is ok), or below
@@ -784,15 +785,18 @@ static struct md_rdev *read_balance(struct r10conf *conf,
if (!do_balance)
break;

+ if (best_slot >= 0)
+ /* At least 2 disks to choose from so failfast is OK */
+ set_bit(R10BIO_FailFast, &r10_bio->state);
/* This optimisation is debatable, and completely destroys
* sequential read speed for 'far copies' arrays. So only
* keep it for 'near' arrays, and review those later.
*/
if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending))
- break;
+ new_distance = 0;

/* for far > 1 always use the lowest address */
- if (geo->far_copies > 1)
+ else if (geo->far_copies > 1)
new_distance = r10_bio->devs[slot].addr;
else
new_distance = abs(r10_bio->devs[slot].addr -
@@ -1171,6 +1175,9 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
read_bio->bi_bdev = rdev->bdev;
read_bio->bi_end_io = raid10_end_read_request;
bio_set_op_attrs(read_bio, op, do_sync);
+ if (test_bit(FailFast, &rdev->flags) &&
+ test_bit(R10BIO_FailFast, &r10_bio->state))
+ read_bio->bi_opf |= MD_FAILFAST;
read_bio->bi_private = r10_bio;

if (mddev->gendisk)
@@ -1987,6 +1994,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
/* now find blocks with errors */
for (i=0 ; i < conf->copies ; i++) {
int j, d;
+ struct md_rdev *rdev;

tbio = r10_bio->devs[i].bio;

@@ -1994,6 +2002,8 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
continue;
if (i == first)
continue;
+ d = r10_bio->devs[i].devnum;
+ rdev = conf->mirrors[d].rdev;
if (!r10_bio->devs[i].bio->bi_error) {
/* We know that the bi_io_vec layout is the same for
* both 'first' and 'i', so we just compare them.
@@ -2016,6 +2026,10 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery))
/* Don't fix anything. */
continue;
+ } else if (test_bit(FailFast, &rdev->flags)) {
+ /* Just give up on this device */
+ md_error(rdev->mddev, rdev);
+ continue;
}
/* Ok, we need to write this bio, either to correct an
* inconsistency or to correct an unreadable block.
@@ -2033,7 +2047,6 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)

bio_copy_data(tbio, fbio);

- d = r10_bio->devs[i].devnum;
atomic_inc(&conf->mirrors[d].rdev->nr_pending);
atomic_inc(&r10_bio->remaining);
md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(tbio));
@@ -2540,12 +2553,14 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
bio_put(bio);
r10_bio->devs[slot].bio = NULL;

- if (mddev->ro == 0) {
+ if (mddev->ro)
+ r10_bio->devs[slot].bio = IO_BLOCKED;
+ else if (!test_bit(FailFast, &rdev->flags)) {
freeze_array(conf, 1);
fix_read_error(conf, mddev, r10_bio);
unfreeze_array(conf);
} else
- r10_bio->devs[slot].bio = IO_BLOCKED;
+ md_error(mddev, rdev);

rdev_dec_pending(rdev, mddev);

@@ -2574,6 +2589,9 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
+ choose_data_offset(r10_bio, rdev);
bio->bi_bdev = rdev->bdev;
bio_set_op_attrs(bio, REQ_OP_READ, do_sync);
+ if (test_bit(FailFast, &rdev->flags) &&
+ test_bit(R10BIO_FailFast, &r10_bio->state))
+ bio->bi_opf |= MD_FAILFAST;
bio->bi_private = r10_bio;
bio->bi_end_io = raid10_end_read_request;
trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
@@ -3095,6 +3113,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
bio->bi_private = r10_bio;
bio->bi_end_io = end_sync_read;
bio_set_op_attrs(bio, REQ_OP_READ, 0);
+ if (test_bit(FailFast, &rdev->flags))
+ bio->bi_opf |= MD_FAILFAST;
from_addr = r10_bio->devs[j].addr;
bio->bi_iter.bi_sector = from_addr +
rdev->data_offset;
@@ -3200,6 +3220,23 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
rdev_dec_pending(mrdev, mddev);
if (mreplace)
rdev_dec_pending(mreplace, mddev);
+ if (r10_bio->devs[0].bio->bi_opf & MD_FAILFAST) {
+ /* Only want this if there is elsewhere to
+ * read from. 'j' is currently the first
+ * readable copy.
+ */
+ int targets = 1;
+ for (; j < conf->copies; j++) {
+ int d = r10_bio->devs[j].devnum;
+ if (conf->mirrors[d].rdev &&
+ test_bit(In_sync,
+ &conf->mirrors[d].rdev->flags))
+ targets++;
+ }
+ if (targets == 1)
+ r10_bio->devs[0].bio->bi_opf
+ &= ~MD_FAILFAST;
+ }
}
if (biolist == NULL) {
while (r10_bio) {
@@ -3278,6 +3315,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
bio->bi_private = r10_bio;
bio->bi_end_io = end_sync_read;
bio_set_op_attrs(bio, REQ_OP_READ, 0);
+ if (test_bit(FailFast, &conf->mirrors[d].rdev->flags))
+ bio->bi_opf |= MD_FAILFAST;
bio->bi_iter.bi_sector = sector + rdev->data_offset;
bio->bi_bdev = rdev->bdev;
count++;
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 18ec1f7a98bf..3162615e57bd 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -156,5 +156,7 @@ enum r10bio_state {
* flag is set
*/
R10BIO_Previous,
+/* failfast devices did receive failfast requests. */
+ R10BIO_FailFast,
};
#endif


2016-11-18 07:09:46

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH/RFC] add "failfast" support for raid1/raid10.

(Seeing that it was me who initiated those patches I guess I should
speak up here)

On 11/18/2016 06:16 AM, NeilBrown wrote:
> Hi,
>
> I've been sitting on these patches for a while because although they
> solve a real problem, it is a fairly limited use-case, and I don't
> really like some of the details.
>
> So I'm posting them as RFC in the hope that a different perspective
> might help me like them better, or find a better approach.
>
[ .. ]
> My two main concerns are:
> - does this functionality have any use-case outside of mirrored
> storage arrays, and are there other storage arrays which
> occasionally inserted excessive latency (seems like a serious
> misfeature to me, but I know few of the details)?
Yes, there are.
I've come across some storage arrays which really take some liberty when
doing internal error recovery; some even take up to 20 minutes
before sending a command completion (the response was "there's nothing
in the SCSI spec which forbids us to do so")

> - would it be at all possible to have "real" failfast functionality
> in the block layer? I.e. something that is based on time rather
> than retry count. Maybe in some cases a retry would be
> appropriate if the first failure was very fast.
> I.e. it would reduce timeouts and decide on retries based on
> elapsed time rather than number of attempts.
> With this would come the question of "how fast is fast" and I
> don't have a really good answer. Maybe md would need to set a
> timeout, which it would double whenever it got failures on all
> drives. Otherwise the timeout would drift towards (say) 10 times
> the typical response time.
>
The current 'failfast' is rather a 'do not attempt error recovery' flag;
ie the SCSI stack should _not_ start error recovery but rather pass the
request upwards in case of failure.
Problem is that there is no real upper limit on the time error recovery
could take, and it's virtually impossible to give an I/O response time
guarantees once error recovery had been invoked.
And to make matters worse, in most cases error recovery won't work
_anyway_ if the transport is severed.
So this is more to do with error recovery, and not so much on the time
each request can/should spend on the fly.

The S/390 DASD case is even worse, as the DASD driver _by design_ will
always have to wait for an answer from the storage array. So if the link
to the array is severed you are in deep trouble, as you'll never get a
completion (or any status, for that matter) until the array is reconnected.

So while the FAILFAST flag is a mere convenience for SCSI, it's a
positive must for S/390 if you want to have a functional RAID.

Cheers,

Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

2016-11-18 15:41:22

by Jack Wang

[permalink] [raw]
Subject: Re: [PATCH/RFC] add "failfast" support for raid1/raid10.

2016-11-18 6:16 GMT+01:00 NeilBrown <[email protected]>:
> Hi,
>
> I've been sitting on these patches for a while because although they
> solve a real problem, it is a fairly limited use-case, and I don't
> really like some of the details.
>
> So I'm posting them as RFC in the hope that a different perspective
> might help me like them better, or find a better approach.
>
> The core idea is that when you have multiple copies of data
> (i.e. mirrored drives) it doesn't make sense to wait for a read from
> a drive that seems to be having problems. It will probably be faster
> to just cancel that read, and read from the other device.
> Similarly, in some circumstances, it might be better to fail a drive
> that is being slow to respond to writes, rather than cause all writes
> to be very slow.
>
> The particular context where this comes up is when mirroring across
> storage arrays, where the storage arrays can temporarily take an
> unusually long time to respond to requests (firmware updates have
> been mentioned). As the array will have redundancy internally, there
> is little risk to the data. The mirrored pair is really only for
> disaster recovery, and it is deemed better to lose the last few
> minutes of updates in the case of a serious disaster, rather than
> occasionally having latency issues because one array needs to do some
> maintenance for a few minutes. The particular storage arrays in
> question are DASD devices which are part of the s390 ecosystem.

Hi Neil,

Thanks for pushing this feature also to mainline.
We at Profitbricks use raid1 across IB network, one pserver with
raid1, both legs on 2 remote storages.
We've noticed if one remote storage crash , and raid1 still keep
sending IO to the faulty leg, even after 5 minutes,
md still redirect I/Os, and md refuse to remove active disks, eg:

2016-10-27T19:47:07.776233+02:00 pserver25 kernel: [184749.101984]
md/raid1:md23: Disk failure on ibnbd47, disabling device.

2016-10-27T19:47:07.776243+02:00 pserver25 kernel: [184749.101984]
md/raid1:md23: Operation continuing on 1 devices.

[...]

2016-10-27T19:47:16.171694+02:00 pserver25 kernel: [184757.498693]
md/raid1:md23: redirecting sector 79104 to other mirror: ibnbd46

[...]

2016-10-27T19:47:21.301732+02:00 pserver25 kernel: [184762.627288]
md/raid1:md23: redirecting sector 79232 to other mirror: ibnbd46

[...]

2016-10-27T19:47:35.501725+02:00 pserver25 kernel: [184776.829069] md:
cannot remove active disk ibnbd47 from md23 ...

2016-10-27T19:47:36.801769+02:00 pserver25 kernel: [184778.128856] md:
cannot remove active disk ibnbd47 from md23 ...

[...]

2016-10-27T19:52:33.401816+02:00 pserver25 kernel: [185074.727859]
md/raid1:md23: redirecting sector 72832 to other mirror: ibnbd46

2016-10-27T19:52:36.601693+02:00 pserver25 kernel: [185077.924835]
md/raid1:md23: redirecting sector 78336 to other mirror: ibnbd46

2016-10-27T19:52:36.601728+02:00 pserver25 kernel: [185077.925083]
RAID1 conf printout:

2016-10-27T19:52:36.601731+02:00 pserver25 kernel: [185077.925087]
--- wd:1 rd:2

2016-10-27T19:52:36.601733+02:00 pserver25 kernel: [185077.925091]
disk 0, wo:0, o:1, dev:ibnbd46

2016-10-27T19:52:36.601735+02:00 pserver25 kernel: [185077.925093]
disk 1, wo:1, o:0, dev:ibnbd47

2016-10-27T19:52:36.681691+02:00 pserver25 kernel: [185078.003392]
RAID1 conf printout:

2016-10-27T19:52:36.681706+02:00 pserver25 kernel: [185078.003404]
--- wd:1 rd:2

2016-10-27T19:52:36.681709+02:00 pserver25 kernel: [185078.003409]
disk 0, wo:0, o:1, dev:ibnbd46

I tried to port you patch from SLES[1], with the patchset, it reduce
the time to ~30 seconds.

I'm happy to see this feature upstream :)
I will test again this new patchset.

Cheers,
Jack Wang

2016-11-22 03:53:20

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH/RFC] add "failfast" support for raid1/raid10.

On Fri, Nov 18, 2016 at 04:16:11PM +1100, Neil Brown wrote:
> Hi,
>
> I've been sitting on these patches for a while because although they
> solve a real problem, it is a fairly limited use-case, and I don't
> really like some of the details.
>
> So I'm posting them as RFC in the hope that a different perspective
> might help me like them better, or find a better approach.
>
> The core idea is that when you have multiple copies of data
> (i.e. mirrored drives) it doesn't make sense to wait for a read from
> a drive that seems to be having problems. It will probably be faster
> to just cancel that read, and read from the other device.
> Similarly, in some circumstances, it might be better to fail a drive
> that is being slow to respond to writes, rather than cause all writes
> to be very slow.
>
> The particular context where this comes up is when mirroring across
> storage arrays, where the storage arrays can temporarily take an
> unusually long time to respond to requests (firmware updates have
> been mentioned). As the array will have redundancy internally, there
> is little risk to the data. The mirrored pair is really only for
> disaster recovery, and it is deemed better to lose the last few
> minutes of updates in the case of a serious disaster, rather than
> occasionally having latency issues because one array needs to do some
> maintenance for a few minutes. The particular storage arrays in
> question are DASD devices which are part of the s390 ecosystem.
>
> Linux block layer has "failfast" flags to direct drivers to fail more
> quickly. These patches allow devices in an md array to be given a
> "failfast" flag, which will cause IO requests to be marked as
> "failfast" providing there is another device available. Once the
> array becomes degraded, we stop using failfast, as that could result
> in data loss.
>
> I don't like the whole "failfast" concept because it is not at all
> clear how fast "fast" is. In fact, these block-layer flags are
> really a misnomer. They should be "noretry" flags.
> REQ_FAILFAST_DEV means "don't retry requests which reported an error
> which seems to come from the device.
> REQ_FAILFAST_TRANSPORT means "don't retry requests which seem to
> indicate a problem with the transport, rather than the device"
> REQ_FAILFAST_DRIVER means .... I'm not exactly sure. I think it
> means whatever a particular driver wants it to mean, basically "I
> cannot seem to handle this right now, just resend and I'll probably
> be more in control next time". It seems to be for internal-use only.
>
> Multipath code uses REQ_FAILFAST_TRANSPORT only, which makes sense.
> btrfs uses REQ_FAILFAST_DEV only (for read-ahead) which doesn't seem
> to make sense.... why would you ever use _DEV without _TRANSPORT?
>
> None of these actually change the timeouts in the driver or in the
> device, which is what I would expect for "failfast", so to get real
> "fast failure" you need to enable failfast, and adjust the timeouts.
> That is what we do for our customers with DASD.
>
> Anyway, it seems to make sense to use _TRANSPORT and _DEV for
> requests from md where there is somewhere to fall-back on.
> If we get an error from a "failfast" request, and the array is still
> non-degraded, we just fail the device. We don't try to repair read
> errors (which is pointless on storage arrays).
>
> It is assumed that some user-space code will notice the failure,
> monitor the device to see when it becomes available again, and then
> --re-add it. Assuming the array has a bitmap, the --re-add should be
> fast and the array will become optimal again without experiencing
> excessive latencies.
>
> My two main concerns are:
> - does this functionality have any use-case outside of mirrored
> storage arrays, and are there other storage arrays which
> occasionally inserted excessive latency (seems like a serious
> misfeature to me, but I know few of the details)?
> - would it be at all possible to have "real" failfast functionality
> in the block layer? I.e. something that is based on time rather
> than retry count. Maybe in some cases a retry would be
> appropriate if the first failure was very fast.
> I.e. it would reduce timeouts and decide on retries based on
> elapsed time rather than number of attempts.
> With this would come the question of "how fast is fast" and I
> don't have a really good answer. Maybe md would need to set a
> timeout, which it would double whenever it got failures on all
> drives. Otherwise the timeout would drift towards (say) 10 times
> the typical response time.
>
> So: comments most welcome. As I say, this does address a genuine
> need. Just find it hard to like it :-(

Patches looks good. As long as these don't break normal raid array (while they
don't if the superblock bit isn't set), I have no objection to apply the
patches even they are for special usage. I'll add the series to the next tree.

Just curious: will the FAILFAST increase the chance IO failure?

Thanks,
Shaohua

2016-11-24 04:47:31

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH/RFC] add "failfast" support for raid1/raid10.

On Sat, Nov 19 2016, Jack Wang wrote:

> 2016-11-18 6:16 GMT+01:00 NeilBrown <[email protected]>:
>> Hi,
>>
>> I've been sitting on these patches for a while because although they
>> solve a real problem, it is a fairly limited use-case, and I don't
>> really like some of the details.
>>
>> So I'm posting them as RFC in the hope that a different perspective
>> might help me like them better, or find a better approach.
>>
>> The core idea is that when you have multiple copies of data
>> (i.e. mirrored drives) it doesn't make sense to wait for a read from
>> a drive that seems to be having problems. It will probably be faster
>> to just cancel that read, and read from the other device.
>> Similarly, in some circumstances, it might be better to fail a drive
>> that is being slow to respond to writes, rather than cause all writes
>> to be very slow.
>>
>> The particular context where this comes up is when mirroring across
>> storage arrays, where the storage arrays can temporarily take an
>> unusually long time to respond to requests (firmware updates have
>> been mentioned). As the array will have redundancy internally, there
>> is little risk to the data. The mirrored pair is really only for
>> disaster recovery, and it is deemed better to lose the last few
>> minutes of updates in the case of a serious disaster, rather than
>> occasionally having latency issues because one array needs to do some
>> maintenance for a few minutes. The particular storage arrays in
>> question are DASD devices which are part of the s390 ecosystem.
>
> Hi Neil,
>
> Thanks for pushing this feature also to mainline.
> We at Profitbricks use raid1 across IB network, one pserver with
> raid1, both legs on 2 remote storages.
> We've noticed if one remote storage crash , and raid1 still keep
> sending IO to the faulty leg, even after 5 minutes,
> md still redirect I/Os, and md refuse to remove active disks, eg:

That make sense. It cannot remove the active disk until all pending IO
completes, either with an error or with success.

If the target has a long timeout, that can delay progress a lot.

>
> I tried to port you patch from SLES[1], with the patchset, it reduce
> the time to ~30 seconds.
>
> I'm happy to see this feature upstream :)
> I will test again this new patchset.

Thanks for your confirmation that this is more generally useful than I
thought, and I'm always happy to hear for more testing :-)

Thanks,
NeilBrown


Attachments:
signature.asc (800.00 B)

2016-11-24 16:06:47

by Jack Wang

[permalink] [raw]
Subject: Re: [PATCH/RFC] add "failfast" support for raid1/raid10.

Hi Neil,

2016-11-24 5:47 GMT+01:00 NeilBrown <[email protected]>:
> On Sat, Nov 19 2016, Jack Wang wrote:
>
>> 2016-11-18 6:16 GMT+01:00 NeilBrown <[email protected]>:
>>> Hi,
>>>
>>> I've been sitting on these patches for a while because although they
>>> solve a real problem, it is a fairly limited use-case, and I don't
>>> really like some of the details.
>>>
>>> So I'm posting them as RFC in the hope that a different perspective
>>> might help me like them better, or find a better approach.
>>>
>>> The core idea is that when you have multiple copies of data
>>> (i.e. mirrored drives) it doesn't make sense to wait for a read from
>>> a drive that seems to be having problems. It will probably be faster
>>> to just cancel that read, and read from the other device.
>>> Similarly, in some circumstances, it might be better to fail a drive
>>> that is being slow to respond to writes, rather than cause all writes
>>> to be very slow.
>>>
>>> The particular context where this comes up is when mirroring across
>>> storage arrays, where the storage arrays can temporarily take an
>>> unusually long time to respond to requests (firmware updates have
>>> been mentioned). As the array will have redundancy internally, there
>>> is little risk to the data. The mirrored pair is really only for
>>> disaster recovery, and it is deemed better to lose the last few
>>> minutes of updates in the case of a serious disaster, rather than
>>> occasionally having latency issues because one array needs to do some
>>> maintenance for a few minutes. The particular storage arrays in
>>> question are DASD devices which are part of the s390 ecosystem.
>>
>> Hi Neil,
>>
>> Thanks for pushing this feature also to mainline.
>> We at Profitbricks use raid1 across IB network, one pserver with
>> raid1, both legs on 2 remote storages.
>> We've noticed if one remote storage crash , and raid1 still keep
>> sending IO to the faulty leg, even after 5 minutes,
>> md still redirect I/Os, and md refuse to remove active disks, eg:
>
> That make sense. It cannot remove the active disk until all pending IO
> completes, either with an error or with success.
>
> If the target has a long timeout, that can delay progress a lot.
>
>>
>> I tried to port you patch from SLES[1], with the patchset, it reduce
>> the time to ~30 seconds.
>>
>> I'm happy to see this feature upstream :)
>> I will test again this new patchset.
>
> Thanks for your confirmation that this is more generally useful than I
> thought, and I'm always happy to hear for more testing :-)
>
> Thanks,
> NeilBrown

Just want to update test result, so far it's working fine, no regression :)
Will report if anything breaks.

Thanks
Jack

2016-11-24 23:56:13

by NeilBrown

[permalink] [raw]
Subject: [mdadm PATCH] Add failfast support.


Allow per-device "failfast" flag to be set when creating an
array or adding devices to an array.

When re-adding a device which had the failfast flag, it can be removed
using --nofailfast.

failfast status is printed in --detail and --examine output.

Signed-off-by: NeilBrown <[email protected]>
---

Hi Jes,
this patch adds mdadm support for the failfast functionality that
Shaohua recently included in his for-next.
Hopefully the man-page additions provide all necessary context.
If there is anything that seems to be missing, I'll be very happy to
add it.

Thanks,
NeilBrown


Create.c | 2 ++
Detail.c | 1 +
Incremental.c | 1 +
Manage.c | 20 +++++++++++++++++++-
ReadMe.c | 2 ++
md.4 | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
md_p.h | 1 +
mdadm.8.in | 32 +++++++++++++++++++++++++++++++-
mdadm.c | 11 +++++++++++
mdadm.h | 5 +++++
super0.c | 12 ++++++++----
super1.c | 13 +++++++++++++
12 files changed, 148 insertions(+), 6 deletions(-)
mode change 100755 => 100644 mdadm.h

diff --git a/Create.c b/Create.c
index 1594a3919139..bd114eabafc1 100644
--- a/Create.c
+++ b/Create.c
@@ -890,6 +890,8 @@ int Create(struct supertype *st, char *mddev,

if (dv->writemostly == 1)
inf->disk.state |= (1<<MD_DISK_WRITEMOSTLY);
+ if (dv->failfast == 1)
+ inf->disk.state |= (1<<MD_DISK_FAILFAST);

if (have_container)
fd = -1;
diff --git a/Detail.c b/Detail.c
index 925e4794c983..509b0d418768 100644
--- a/Detail.c
+++ b/Detail.c
@@ -658,6 +658,7 @@ This is pretty boring
}
if (disk.state & (1<<MD_DISK_REMOVED)) printf(" removed");
if (disk.state & (1<<MD_DISK_WRITEMOSTLY)) printf(" writemostly");
+ if (disk.state & (1<<MD_DISK_FAILFAST)) printf(" failfast");
if (disk.state & (1<<MD_DISK_JOURNAL)) printf(" journal");
if ((disk.state &
((1<<MD_DISK_ACTIVE)|(1<<MD_DISK_SYNC)
diff --git a/Incremental.c b/Incremental.c
index cc01d41e641a..75d95ccc497a 100644
--- a/Incremental.c
+++ b/Incremental.c
@@ -1035,6 +1035,7 @@ static int array_try_spare(char *devname, int *dfdp, struct dev_policy *pol,
devlist.next = NULL;
devlist.used = 0;
devlist.writemostly = 0;
+ devlist.failfast = 0;
devlist.devname = chosen_devname;
sprintf(chosen_devname, "%d:%d", major(stb.st_rdev),
minor(stb.st_rdev));
diff --git a/Manage.c b/Manage.c
index 1b7b0c111c83..429d8631cd23 100644
--- a/Manage.c
+++ b/Manage.c
@@ -683,8 +683,13 @@ int attempt_re_add(int fd, int tfd, struct mddev_dev *dv,
disc.state |= 1 << MD_DISK_WRITEMOSTLY;
if (dv->writemostly == 2)
disc.state &= ~(1 << MD_DISK_WRITEMOSTLY);
+ if (dv->failfast == 1)
+ disc.state |= 1 << MD_DISK_FAILFAST;
+ if (dv->failfast == 2)
+ disc.state &= ~(1 << MD_DISK_FAILFAST);
remove_partitions(tfd);
- if (update || dv->writemostly > 0) {
+ if (update || dv->writemostly > 0
+ || dv->failfast > 0) {
int rv = -1;
tfd = dev_open(dv->devname, O_RDWR);
if (tfd < 0) {
@@ -700,6 +705,14 @@ int attempt_re_add(int fd, int tfd, struct mddev_dev *dv,
rv = dev_st->ss->update_super(
dev_st, NULL, "readwrite",
devname, verbose, 0, NULL);
+ if (dv->failfast == 1)
+ rv = dev_st->ss->update_super(
+ dev_st, NULL, "failfast",
+ devname, verbose, 0, NULL);
+ if (dv->failfast == 2)
+ rv = dev_st->ss->update_super(
+ dev_st, NULL, "nofailfast",
+ devname, verbose, 0, NULL);
if (update)
rv = dev_st->ss->update_super(
dev_st, NULL, update,
@@ -964,6 +977,8 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
disc.state |= (1 << MD_DISK_JOURNAL) | (1 << MD_DISK_SYNC);
if (dv->writemostly == 1)
disc.state |= 1 << MD_DISK_WRITEMOSTLY;
+ if (dv->failfast == 1)
+ disc.state |= 1 << MD_DISK_FAILFAST;
dfd = dev_open(dv->devname, O_RDWR | O_EXCL|O_DIRECT);
if (tst->ss->add_to_super(tst, &disc, dfd,
dv->devname, INVALID_SECTORS))
@@ -1009,6 +1024,8 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,

if (dv->writemostly == 1)
disc.state |= (1 << MD_DISK_WRITEMOSTLY);
+ if (dv->failfast == 1)
+ disc.state |= (1 << MD_DISK_FAILFAST);
if (tst->ss->external) {
/* add a disk
* to an external metadata container */
@@ -1785,6 +1802,7 @@ int move_spare(char *from_devname, char *to_devname, dev_t devid)
devlist.next = NULL;
devlist.used = 0;
devlist.writemostly = 0;
+ devlist.failfast = 0;
devlist.devname = devname;
sprintf(devname, "%d:%d", major(devid), minor(devid));

diff --git a/ReadMe.c b/ReadMe.c
index d3fcb6132fe9..8da49ef46dfb 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -136,6 +136,8 @@ struct option long_options[] = {
{"bitmap-chunk", 1, 0, BitmapChunk},
{"write-behind", 2, 0, WriteBehind},
{"write-mostly",0, 0, WriteMostly},
+ {"failfast", 0, 0, FailFast},
+ {"nofailfast",0, 0, NoFailFast},
{"re-add", 0, 0, ReAdd},
{"homehost", 1, 0, HomeHost},
{"symlinks", 1, 0, Symlinks},
diff --git a/md.4 b/md.4
index f1b88ee6bb03..5bdf7a7bd375 100644
--- a/md.4
+++ b/md.4
@@ -916,6 +916,60 @@ slow). The extra latency of the remote link will not slow down normal
operations, but the remote system will still have a reasonably
up-to-date copy of all data.

+.SS FAILFAST
+
+From Linux 4.10,
+.I
+md
+supports FAILFAST for RAID1 and RAID10 arrays. This is a flag that
+can be set on individual drives, though it is usually set on all
+drives, or no drives.
+
+When
+.I md
+sends an I/O request to a drive that is marked as FAILFAST, and when
+the array could survive the loss of that drive without losing data,
+.I md
+will request that the underlying device does not perform any retries.
+This means that a failure will be reported to
+.I md
+promptly, and it can mark the device as faulty and continue using the
+other device(s).
+.I md
+cannot control the timeout that the underlying devices use to
+determine failure. Any changes desired to that timeout must be set
+explictly on the underlying device, separately from using
+.IR mdadm .
+
+If a FAILFAST request does fail, and if it is still safe to mark the
+device as faulty without data loss, that will be done and the array
+will continue functioning on a reduced number of devices. If it is not
+possible to safely mark the device as faulty,
+.I md
+will retry the request without disabling retries in the underlying
+device. In any case,
+.I md
+will not attempt to repair read errors on a device marked as FAILFAST
+by writing out the correct. It will just mark the device as faulty.
+
+FAILFAST is appropriate for storage arrays that have a low probability
+of true failure, but will sometimes introduce unacceptable delays to
+I/O requests while performing internal maintenance. The value of
+setting FAILFAST involves a trade-off. The gain is that the chance of
+unacceptable delays is substantially reduced. The cost is that the
+unlikely event of data-loss on one device is slightly more likely to
+result in data-loss for the array.
+
+When a device in an array using FAILFAST is marked as faulty, it will
+usually become usable again in a short while.
+.I mdadm
+makes no attempt to detect that possibility. Some separate
+mechanism, tuned to the specific details of the expected failure modes,
+needs to be created to monitor devices to see when they return to full
+functionality, and to then re-add them to the array. In order of
+this "re-add" functionality to be effective, an array using FAILFAST
+should always have a write-intent bitmap.
+
.SS RESTRIPING

.IR Restriping ,
diff --git a/md_p.h b/md_p.h
index 0d691fbc987d..dc9fec165cb6 100644
--- a/md_p.h
+++ b/md_p.h
@@ -89,6 +89,7 @@
* read requests will only be sent here in
* dire need
*/
+#define MD_DISK_FAILFAST 10 /* Fewer retries, more failures */

#define MD_DISK_REPLACEMENT 17
#define MD_DISK_JOURNAL 18 /* disk is used as the write journal in RAID-5/6 */
diff --git a/mdadm.8.in b/mdadm.8.in
index 3c0c58f95f35..aa80f0c1a631 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -747,7 +747,7 @@ subsequent devices listed in a
.BR \-\-create ,
or
.B \-\-add
-command will be flagged as 'write-mostly'. This is valid for RAID1
+command will be flagged as 'write\-mostly'. This is valid for RAID1
only and means that the 'md' driver will avoid reading from these
devices if at all possible. This can be useful if mirroring over a
slow link.
@@ -762,6 +762,25 @@ mode, and write-behind is only attempted on drives marked as
.IR write-mostly .

.TP
+.BR \-\-failfast
+subsequent devices listed in a
+.B \-\-create
+or
+.B \-\-add
+command will be flagged as 'failfast'. This is valid for RAID1 and
+RAID10 only. IO requests to these devices will be encouraged to fail
+quickly rather than cause long delays due to error handling. Also no
+attempt is made to repair a read error on these devices.
+
+If an array becomes degraded so that the 'failfast' device is the only
+usable device, the 'failfast' flag will then be ignored and extended
+delays will be preferred to complete failure.
+
+The 'failfast' flag is appropriate for storage arrays which have a
+low probability of true failure, but which may sometimes
+cause unacceptable delays due to internal maintenance functions.
+
+.TP
.BR \-\-assume\-clean
Tell
.I mdadm
@@ -1452,6 +1471,17 @@ that had a failed journal. To avoid interrupting on-going write opertions,
.B \-\-add-journal
only works for array in Read-Only state.

+.TP
+.BR \-\-failfast
+Subsequent devices that are added or re\-added will have
+the 'failfast' flag set. This is only valid for RAID1 and RAID10 and
+means that the 'md' driver will avoid long timeouts on error handling
+where possible.
+.TP
+.BR \-\-nofailfast
+Subsequent devices that are re\-added will be re\-added without
+the 'failfast' flag set.
+
.P
Each of these options requires that the first device listed is the array
to be acted upon, and the remainder are component devices to be added,
diff --git a/mdadm.c b/mdadm.c
index cca093318d8d..3c8f273c8254 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -90,6 +90,7 @@ int main(int argc, char *argv[])
int spare_sharing = 1;
struct supertype *ss = NULL;
int writemostly = 0;
+ int failfast = 0;
char *shortopt = short_options;
int dosyslog = 0;
int rebuild_map = 0;
@@ -295,6 +296,7 @@ int main(int argc, char *argv[])
dv->devname = optarg;
dv->disposition = devmode;
dv->writemostly = writemostly;
+ dv->failfast = failfast;
dv->used = 0;
dv->next = NULL;
*devlistend = dv;
@@ -351,6 +353,7 @@ int main(int argc, char *argv[])
dv->devname = optarg;
dv->disposition = devmode;
dv->writemostly = writemostly;
+ dv->failfast = failfast;
dv->used = 0;
dv->next = NULL;
*devlistend = dv;
@@ -417,6 +420,14 @@ int main(int argc, char *argv[])
writemostly = 2;
continue;

+ case O(MANAGE,FailFast):
+ case O(CREATE,FailFast):
+ failfast = 1;
+ continue;
+ case O(MANAGE,NoFailFast):
+ failfast = 2;
+ continue;
+
case O(GROW,'z'):
case O(CREATE,'z'):
case O(BUILD,'z'): /* size */
diff --git a/mdadm.h b/mdadm.h
old mode 100755
new mode 100644
index 240ab7f831bc..d47de01f725b
--- a/mdadm.h
+++ b/mdadm.h
@@ -383,6 +383,8 @@ enum special_options {
ConfigFile,
ChunkSize,
WriteMostly,
+ FailFast,
+ NoFailFast,
Layout,
Auto,
Force,
@@ -516,6 +518,7 @@ struct mddev_dev {
* Not set for names read from .config
*/
char writemostly; /* 1 for 'set writemostly', 2 for 'clear writemostly' */
+ char failfast; /* Ditto but for 'failfast' flag */
int used; /* set when used */
long long data_offset;
struct mddev_dev *next;
@@ -821,6 +824,8 @@ extern struct superswitch {
* linear-grow-update - now change the size of the array.
* writemostly - set the WriteMostly1 bit in the superblock devflags
* readwrite - clear the WriteMostly1 bit in the superblock devflags
+ * failfast - set the FailFast1 bit in the superblock
+ * nofailfast - clear the FailFast1 bit
* no-bitmap - clear any record that a bitmap is present.
* bbl - add a bad-block-log if possible
* no-bbl - remove any bad-block-log is it is empty.
diff --git a/super0.c b/super0.c
index 55ebd8bc7877..938cfd95fa25 100644
--- a/super0.c
+++ b/super0.c
@@ -232,14 +232,15 @@ static void examine_super0(struct supertype *st, char *homehost)
mdp_disk_t *dp;
char *dv;
char nb[5];
- int wonly;
+ int wonly, failfast;
if (d>=0) dp = &sb->disks[d];
else dp = &sb->this_disk;
snprintf(nb, sizeof(nb), "%4d", d);
printf("%4s %5d %5d %5d %5d ", d < 0 ? "this" : nb,
dp->number, dp->major, dp->minor, dp->raid_disk);
wonly = dp->state & (1 << MD_DISK_WRITEMOSTLY);
- dp->state &= ~(1 << MD_DISK_WRITEMOSTLY);
+ failfast = dp->state & (1<<MD_DISK_FAILFAST);
+ dp->state &= ~(wonly | failfast);
if (dp->state & (1 << MD_DISK_FAULTY))
printf(" faulty");
if (dp->state & (1 << MD_DISK_ACTIVE))
@@ -250,6 +251,8 @@ static void examine_super0(struct supertype *st, char *homehost)
printf(" removed");
if (wonly)
printf(" write-mostly");
+ if (failfast)
+ printf(" failfast");
if (dp->state == 0)
printf(" spare");
if ((dv = map_dev(dp->major, dp->minor, 0)))
@@ -581,7 +584,8 @@ static int update_super0(struct supertype *st, struct mdinfo *info,
} else if (strcmp(update, "assemble")==0) {
int d = info->disk.number;
int wonly = sb->disks[d].state & (1<<MD_DISK_WRITEMOSTLY);
- int mask = (1<<MD_DISK_WRITEMOSTLY);
+ int failfast = sb->disks[d].state & (1<<MD_DISK_FAILFAST);
+ int mask = (1<<MD_DISK_WRITEMOSTLY)|(1<<MD_DISK_FAILFAST);
int add = 0;
if (sb->minor_version >= 91)
/* During reshape we don't insist on everything
@@ -590,7 +594,7 @@ static int update_super0(struct supertype *st, struct mdinfo *info,
add = (1<<MD_DISK_SYNC);
if (((sb->disks[d].state & ~mask) | add)
!= (unsigned)info->disk.state) {
- sb->disks[d].state = info->disk.state | wonly;
+ sb->disks[d].state = info->disk.state | wonly |failfast;
rv = 1;
}
if (info->reshape_active &&
diff --git a/super1.c b/super1.c
index d3234392d453..87a74cb94508 100644
--- a/super1.c
+++ b/super1.c
@@ -77,6 +77,7 @@ struct mdp_superblock_1 {
__u8 device_uuid[16]; /* user-space setable, ignored by kernel */
__u8 devflags; /* per-device flags. Only one defined...*/
#define WriteMostly1 1 /* mask for writemostly flag in above */
+#define FailFast1 2 /* Device should get FailFast requests */
/* bad block log. If there are any bad blocks the feature flag is set.
* if offset and size are non-zero, that space is reserved and available.
*/
@@ -430,6 +431,8 @@ static void examine_super1(struct supertype *st, char *homehost)
printf(" Flags :");
if (sb->devflags & WriteMostly1)
printf(" write-mostly");
+ if (sb->devflags & FailFast1)
+ printf(" failfast");
printf("\n");
}

@@ -1020,6 +1023,8 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
}
if (sb->devflags & WriteMostly1)
info->disk.state |= (1 << MD_DISK_WRITEMOSTLY);
+ if (sb->devflags & FailFast1)
+ info->disk.state |= (1 << MD_DISK_FAILFAST);
info->events = __le64_to_cpu(sb->events);
sprintf(info->text_version, "1.%d", st->minor_version);
info->safe_mode_delay = 200;
@@ -1377,6 +1382,10 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
sb->devflags |= WriteMostly1;
else if (strcmp(update, "readwrite")==0)
sb->devflags &= ~WriteMostly1;
+ else if (strcmp(update, "failfast") == 0)
+ sb->devflags |= FailFast1;
+ else if (strcmp(update, "nofailfast") == 0)
+ sb->devflags &= ~FailFast1;
else
rv = -1;

@@ -1713,6 +1722,10 @@ static int write_init_super1(struct supertype *st)
sb->devflags |= WriteMostly1;
else
sb->devflags &= ~WriteMostly1;
+ if (di->disk.state & (1<<MD_DISK_FAILFAST))
+ sb->devflags |= FailFast1;
+ else
+ sb->devflags &= ~FailFast1;

random_uuid(sb->device_uuid);

--
2.10.2


Attachments:
signature.asc (800.00 B)

2016-11-28 13:53:18

by Jes Sorensen

[permalink] [raw]
Subject: Re: [mdadm PATCH] Add failfast support.

NeilBrown <[email protected]> writes:
> Allow per-device "failfast" flag to be set when creating an
> array or adding devices to an array.
>
> When re-adding a device which had the failfast flag, it can be removed
> using --nofailfast.
>
> failfast status is printed in --detail and --examine output.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
>
> Hi Jes,
> this patch adds mdadm support for the failfast functionality that
> Shaohua recently included in his for-next.
> Hopefully the man-page additions provide all necessary context.
> If there is anything that seems to be missing, I'll be very happy to
> add it.

Hi Neil,

It looks reasonable. The only minor concern I have is over the use of
hardcoded magic numbers like 'dv->failfast = 2' rather than using an
enum or defines with descriptive names. Something that can be addressed
later, so patch applied.

Cheers,
Jes