2018-11-28 03:54:02

by Allison Henderson

[permalink] [raw]
Subject: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry

Motivation:
When fs data/metadata checksum mismatch, lower block devices may have other
correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
decides that the metadata is garbage, today it will shut down the entire
filesystem without trying any of the other mirrors. This is a severe
loss of service, and we propose these patches to have XFS try harder to
avoid failure.

This patch prototype this mirror retry idea by:
* Adding @nr_mirrors to struct request_queue which is similar as
blk_queue_nonrot(), filesystem can grab device request queue and check max
mirrors this block device has.
Helper functions were also added to get/set the nr_mirrors.

* Expanding bi_write_hint to bi_rw_hint, now @bi_rw_hint has three meanings.
1.Original write_hint.
2.end_io() will update @bi_rw_hint to reflect which mirror this i/o really happened.
3.Fs set @bi_rw_hint to force driver e.g raid1 read from a specific mirror.

* Modify md/raid1 to support this retry feature.

* Add b_rw_hint to xfs_buf
This patch adds a new field b_rw_hint to xfs_buf. We will use this to set the
new bio->bi_rw_hint when submitting the read request, and also to store the
returned mirror when the read compleates

* Add device retry
This patch add some logic to xfs_buf_read_map. If the read verify
fails, we loop over the available mirrors and retry the read

* Rewrite retried read
When the read verification fails, but the retry succeedes
write the buffer back to correct the bad mirror

* Add tracepoints and logging to alternate device retry.
This patch adds new log entries and trace points to the alternate device retry
error path.

We're not planning to take over all 16 bits of the read hint field; just looking for
feedback about the sanity of the overall approach.

Allison Henderson (4):
xfs: Add b_rw_hint to xfs_buf
xfs: Add device retry
xfs: Rewrite retried read
xfs: Add tracepoints and logging to alternate device retry

Bob Liu (3):
block: add nr_mirrors to request_queue
block: expand write_hint of bio/request to rw_hint
md: raid1: handle bi_rw_hint accordingly

Documentation/block/biodoc.txt | 7 ++++++
block/bio.c | 2 +-
block/blk-core.c | 13 ++++++++++-
block/blk-merge.c | 8 +++----
block/blk-settings.c | 18 ++++++++++++++
block/bounce.c | 2 +-
drivers/md/raid1.c | 33 ++++++++++++++++++++++----
drivers/md/raid5.c | 10 ++++----
drivers/md/raid5.h | 2 +-
drivers/nvme/host/core.c | 2 +-
fs/block_dev.c | 6 +++--
fs/btrfs/extent_io.c | 3 ++-
fs/buffer.c | 3 ++-
fs/direct-io.c | 3 ++-
fs/ext4/page-io.c | 7 ++++--
fs/f2fs/data.c | 2 +-
fs/iomap.c | 3 ++-
fs/mpage.c | 2 +-
fs/xfs/xfs_aops.c | 4 ++--
fs/xfs/xfs_buf.c | 53 ++++++++++++++++++++++++++++++++++++++++--
fs/xfs/xfs_buf.h | 8 +++++++
fs/xfs/xfs_trace.h | 6 ++++-
include/linux/blk_types.h | 2 +-
include/linux/blkdev.h | 5 +++-
24 files changed, 169 insertions(+), 35 deletions(-)

--
2.7.4



2018-11-28 03:53:33

by Allison Henderson

[permalink] [raw]
Subject: [PATCH v1 7/7] xfs: Add tracepoints and logging to alternate device retry

This patch adds new log entries and trace points to the
alternate device retry error path

Signed-off-by: Allison Henderson <[email protected]>
---
fs/xfs/xfs_buf.c | 14 +++++++++++++-
fs/xfs/xfs_buf.h | 1 +
fs/xfs/xfs_trace.h | 6 +++++-
3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 81f6491..f203ddebe 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -833,6 +833,10 @@ xfs_buf_read_map(
for (i = 0; i <= blk_queue_get_mirrors(q); i++) {
bp->b_error = 0;
bp->b_rw_hint = i;
+
+ if (i > 0)
+ xfs_alert(bp->b_target->bt_mount,
+ "Retrying read from disk %hu",i);
_xfs_buf_read(bp, flags);

switch (bp->b_error) {
@@ -840,6 +844,11 @@ xfs_buf_read_map(
case -EFSCORRUPTED:
case -EFSBADCRC:
/* loop again */
+ trace_xfs_buf_ioretry(bp, _RET_IP_);
+ xfs_alert(bp->b_target->bt_mount,
+ "Read error:%d from disk number %hu",
+ bp->b_error, bp->b_rw_hint);
+
continue;
default:
goto retry_done;
@@ -852,8 +861,11 @@ xfs_buf_read_map(
* if we had to try more than one mirror to sucessfully read
* the buffer, write the buffer back
*/
- if (!bp->b_error && i > 0)
+ if (!bp->b_error && i > 0) {
+ xfs_alert(bp->b_target->bt_mount,
+ "Re-writeing verified data");
xfs_bwrite(bp);
+ }

return bp;
}
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index db138e5..23c9c3e 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -300,6 +300,7 @@ extern void __xfs_buf_ioerror(struct xfs_buf *bp, int error,
xfs_failaddr_t failaddr);
#define xfs_buf_ioerror(bp, err) __xfs_buf_ioerror((bp), (err), __this_address)
extern void xfs_buf_ioerror_alert(struct xfs_buf *, const char *func);
+extern void xfs_buf_ioretry_alert(struct xfs_buf *, const char *func);

extern int __xfs_buf_submit(struct xfs_buf *bp, bool);
static inline int xfs_buf_submit(struct xfs_buf *bp)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 3043e5e..1d98a3e 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -276,6 +276,7 @@ DECLARE_EVENT_CLASS(xfs_buf_class,
__field(int, pincount)
__field(unsigned, lockval)
__field(unsigned, flags)
+ __field(unsigned short, rw_hint)
__field(unsigned long, caller_ip)
),
TP_fast_assign(
@@ -286,10 +287,11 @@ DECLARE_EVENT_CLASS(xfs_buf_class,
__entry->pincount = atomic_read(&bp->b_pin_count);
__entry->lockval = bp->b_sema.count;
__entry->flags = bp->b_flags;
+ __entry->rw_hint = bp->b_rw_hint;
__entry->caller_ip = caller_ip;
),
TP_printk("dev %d:%d bno 0x%llx nblks 0x%x hold %d pincount %d "
- "lock %d flags %s caller %pS",
+ "lock %d flags %s rw_hint %hu caller %pS",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long long)__entry->bno,
__entry->nblks,
@@ -297,6 +299,7 @@ DECLARE_EVENT_CLASS(xfs_buf_class,
__entry->pincount,
__entry->lockval,
__print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
+ __entry->rw_hint,
(void *)__entry->caller_ip)
)

@@ -309,6 +312,7 @@ DEFINE_BUF_EVENT(xfs_buf_free);
DEFINE_BUF_EVENT(xfs_buf_hold);
DEFINE_BUF_EVENT(xfs_buf_rele);
DEFINE_BUF_EVENT(xfs_buf_iodone);
+DEFINE_BUF_EVENT(xfs_buf_ioretry);
DEFINE_BUF_EVENT(xfs_buf_submit);
DEFINE_BUF_EVENT(xfs_buf_lock);
DEFINE_BUF_EVENT(xfs_buf_lock_done);
--
2.7.4


2018-11-28 03:53:40

by Allison Henderson

[permalink] [raw]
Subject: [PATCH v1 6/7] xfs: Rewrite retried read

If we had to try more than one mirror to get a successful
read, then write that buffer back to correct the bad mirro

Signed-off-by: Allison Henderson <[email protected]>
---
fs/xfs/xfs_buf.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index f102d01..81f6491 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -847,6 +847,14 @@ xfs_buf_read_map(

}
retry_done:
+
+ /*
+ * if we had to try more than one mirror to sucessfully read
+ * the buffer, write the buffer back
+ */
+ if (!bp->b_error && i > 0)
+ xfs_bwrite(bp);
+
return bp;
}

--
2.7.4


2018-11-28 03:53:51

by Allison Henderson

[permalink] [raw]
Subject: [PATCH v1 5/7] xfs: Add device retry

Check to see if the _xfs_buf_read fails. If so loop over the
available mirrors and retry the read

Signed-off-by: Allison Henderson <[email protected]>
---
fs/xfs/xfs_buf.c | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index dd8ba59..f102d01 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -21,6 +21,7 @@
#include <linux/migrate.h>
#include <linux/backing-dev.h>
#include <linux/freezer.h>
+#include <linux/blkdev.h>

#include "xfs_format.h"
#include "xfs_log_format.h"
@@ -808,6 +809,8 @@ xfs_buf_read_map(
const struct xfs_buf_ops *ops)
{
struct xfs_buf *bp;
+ struct request_queue *q;
+ unsigned short i;

flags |= XBF_READ;

@@ -820,7 +823,30 @@ xfs_buf_read_map(
if (!(bp->b_flags & XBF_DONE)) {
XFS_STATS_INC(target->bt_mount, xb_get_read);
bp->b_ops = ops;
- _xfs_buf_read(bp, flags);
+ q = bdev_get_queue(bp->b_target->bt_bdev);
+
+ /*
+ * Mirrors are indexed 1 - n, specified through the rw_hint.
+ * Setting the hint to 0 is unspecified and allows the block
+ * layer to decide.
+ */
+ for (i = 0; i <= blk_queue_get_mirrors(q); i++) {
+ bp->b_error = 0;
+ bp->b_rw_hint = i;
+ _xfs_buf_read(bp, flags);
+
+ switch (bp->b_error) {
+ case -EIO:
+ case -EFSCORRUPTED:
+ case -EFSBADCRC:
+ /* loop again */
+ continue;
+ default:
+ goto retry_done;
+ }
+
+ }
+retry_done:
return bp;
}

--
2.7.4


2018-11-28 03:54:07

by Allison Henderson

[permalink] [raw]
Subject: [PATCH v1 4/7] xfs: Add b_rw_hint to xfs_buf

This patch adds a new field b_rw_hint to xfs_buf. We will
need this to properly initialize the new bio->bi_rw_hint when
submitting the read request. When the read completes, we
then store the returned mirror in the b_rw_hint.

Signed-off-by: Allison Henderson <[email protected]>
---
fs/xfs/xfs_buf.c | 5 ++++-
fs/xfs/xfs_buf.h | 7 +++++++
2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index b21ea2b..dd8ba59 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1322,8 +1322,10 @@ xfs_buf_bio_end_io(
if (!bp->b_error && xfs_buf_is_vmapped(bp) && (bp->b_flags & XBF_READ))
invalidate_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp));

- if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
+ if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
+ bp->b_rw_hint = bio->bi_rw_hint;
xfs_buf_ioend_async(bp);
+ }
bio_put(bio);
}

@@ -1369,6 +1371,7 @@ xfs_buf_ioapply_map(
bio->bi_iter.bi_sector = sector;
bio->bi_end_io = xfs_buf_bio_end_io;
bio->bi_private = bp;
+ bio->bi_rw_hint = bp->b_rw_hint;
bio_set_op_attrs(bio, op, op_flags);

for (; size && nr_pages; nr_pages--, page_index++) {
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index b9f5511..db138e5 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -197,6 +197,13 @@ typedef struct xfs_buf {
unsigned long b_first_retry_time; /* in jiffies */
int b_last_error;

+ /*
+ * If b_rw_hint is set before a read, it specifies an alternate mirror
+ * to read from. Upon bio completion, b_rw_hint stores the last mirror
+ * that was read from
+ */
+ unsigned short b_rw_hint;
+
const struct xfs_buf_ops *b_ops;
} xfs_buf_t;

--
2.7.4


2018-11-28 03:54:42

by Allison Henderson

[permalink] [raw]
Subject: [PATCH v1 2/7] block: expand write_hint of bio/request to rw_hint

From: Bob Liu <[email protected]>

Write_hint was expanded to rw_hint in order to to alternative mirror device
retry.

* Renaming @bi_write_hint in 'struct bio' to @bi_rw_hint, and @write_hint
in 'struct request' to @rw_hint.

* Making @bi_rw_hint only be updated for WRITE IO. It isn't a problem before
because READ didn't use this hint at all.

* Setting @bi_rw_hint to specify which mirror to read from by force.

* Recording which mirror i/o really went to. This is because lower layer
e.g MD/radi1 driver may have optimization to spread i/o on different copies,
Upper layer e.g fs doesn't have idea data was reading from which device/mirror,
so as can not start retry.

Todo:
- Eat no more than 3-4 of the hint bits since most devices won't have more than
8-16 mirrors.

Signed-off-by: Bob Liu <[email protected]>
---
Documentation/block/biodoc.txt | 7 +++++++
block/bio.c | 2 +-
block/blk-core.c | 10 +++++++++-
block/blk-merge.c | 8 ++++----
block/bounce.c | 2 +-
drivers/md/raid1.c | 2 +-
drivers/md/raid5.c | 10 +++++-----
drivers/md/raid5.h | 2 +-
drivers/nvme/host/core.c | 2 +-
fs/block_dev.c | 6 ++++--
fs/btrfs/extent_io.c | 3 ++-
fs/buffer.c | 3 ++-
fs/direct-io.c | 3 ++-
fs/ext4/page-io.c | 7 +++++--
fs/f2fs/data.c | 2 +-
fs/iomap.c | 3 ++-
fs/mpage.c | 2 +-
fs/xfs/xfs_aops.c | 4 ++--
include/linux/blk_types.h | 2 +-
include/linux/blkdev.h | 2 +-
20 files changed, 53 insertions(+), 29 deletions(-)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index 207eca5..65cda9e 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -431,6 +431,7 @@ struct bio {
struct bio *bi_next; /* request queue link */
struct block_device *bi_bdev; /* target device */
unsigned long bi_flags; /* status, command, etc */
+ unsigned short bi_rw_hint; /* bio read/write hint */
unsigned long bi_opf; /* low bits: r/w, high: priority */

unsigned int bi_vcnt; /* how may bio_vec's */
@@ -465,6 +466,12 @@ With this multipage bio design:
(e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE)
[TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying
bi_offset an len fields]
+- bi_rw_hint is an in/out parameter. Fs can set bi_rw_hint in submit_bio() to
+ specify which mirror/copy to read from by force. Zero is a special value
+ means fs don't care about reading from which mirror/copy. Starting from 1
+ means to read from the 'bi_rw_hint-1' mirror mandatory.
+ bi_rw_hint was set to indicate which mirror this i/o was really
+ happened on completion.

(*) unrelated merges -- a request ends up containing two or more bios that
didn't originate from the same place.
diff --git a/block/bio.c b/block/bio.c
index d5368a4..25f1b22 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -605,7 +605,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
if (bio_flagged(bio_src, BIO_THROTTLED))
bio_set_flag(bio, BIO_THROTTLED);
bio->bi_opf = bio_src->bi_opf;
- bio->bi_write_hint = bio_src->bi_write_hint;
+ bio->bi_rw_hint = bio_src->bi_rw_hint;
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;

diff --git a/block/blk-core.c b/block/blk-core.c
index 50779c8..e9f7080 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1980,7 +1980,7 @@ void blk_init_request_from_bio(struct request *req, struct bio *bio)
req->ioprio = ioc->ioprio;
else
req->ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
- req->write_hint = bio->bi_write_hint;
+ req->rw_hint = bio->bi_rw_hint;
blk_rq_bio_prep(req->q, req, bio);
}
EXPORT_SYMBOL_GPL(blk_init_request_from_bio);
@@ -2314,6 +2314,14 @@ generic_make_request_checks(struct bio *bio)
if (!q->limits.max_write_zeroes_sectors)
goto not_supported;
break;
+ /*
+ * Zero is special value which means upper layer e.g fs don't care
+ * about reading from which mirror.
+ * Starting from 1 means reading from mirror 'bi_rw_hint-1' mandatory.
+ */
+ case REQ_OP_READ:
+ if (bio->bi_rw_hint < 0 || bio->bi_rw_hint > q->nr_mirrors)
+ goto not_supported;
default:
break;
}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 6b5ad27..e32e2d2 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -766,10 +766,10 @@ static struct request *attempt_merge(struct request_queue *q,
return NULL;

/*
- * Don't allow merge of different write hints, or for a hint with
+ * Don't allow merge of different rw hints, or for a hint with
* non-hint IO.
*/
- if (req->write_hint != next->write_hint)
+ if (req->rw_hint != next->rw_hint)
return NULL;

/*
@@ -904,10 +904,10 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
return false;

/*
- * Don't allow merge of different write hints, or for a hint with
+ * Don't allow merge of different rw hints, or for a hint with
* non-hint IO.
*/
- if (rq->write_hint != bio->bi_write_hint)
+ if (rq->rw_hint != bio->bi_rw_hint)
return false;

return true;
diff --git a/block/bounce.c b/block/bounce.c
index 36869af..a7b789e 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -248,7 +248,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
return NULL;
bio->bi_disk = bio_src->bi_disk;
bio->bi_opf = bio_src->bi_opf;
- bio->bi_write_hint = bio_src->bi_write_hint;
+ bio->bi_rw_hint = bio_src->bi_rw_hint;
bio->bi_iter.bi_sector = bio_src->bi_iter.bi_sector;
bio->bi_iter.bi_size = bio_src->bi_iter.bi_size;

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 1d54109..fedf8c0 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1102,7 +1102,7 @@ static void alloc_behind_master_bio(struct r1bio *r1_bio,
goto skip_copy;
}

- behind_bio->bi_write_hint = bio->bi_write_hint;
+ behind_bio->bi_rw_hint = bio->bi_rw_hint;

while (i < vcnt && size) {
struct page *page;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4990f03..37593a0 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1137,9 +1137,9 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
bi->bi_io_vec[0].bv_len = STRIPE_SIZE;
bi->bi_io_vec[0].bv_offset = 0;
bi->bi_iter.bi_size = STRIPE_SIZE;
- bi->bi_write_hint = sh->dev[i].write_hint;
+ bi->bi_rw_hint = sh->dev[i].rw_hint;
if (!rrdev)
- sh->dev[i].write_hint = RWF_WRITE_LIFE_NOT_SET;
+ sh->dev[i].rw_hint = RWF_WRITE_LIFE_NOT_SET;
/*
* If this is discard request, set bi_vcnt 0. We don't
* want to confuse SCSI because SCSI will replace payload
@@ -1191,8 +1191,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
rbi->bi_io_vec[0].bv_len = STRIPE_SIZE;
rbi->bi_io_vec[0].bv_offset = 0;
rbi->bi_iter.bi_size = STRIPE_SIZE;
- rbi->bi_write_hint = sh->dev[i].write_hint;
- sh->dev[i].write_hint = RWF_WRITE_LIFE_NOT_SET;
+ rbi->bi_rw_hint = sh->dev[i].rw_hint;
+ sh->dev[i].rw_hint = RWF_WRITE_LIFE_NOT_SET;
/*
* If this is discard request, set bi_vcnt 0. We don't
* want to confuse SCSI because SCSI will replace payload
@@ -3219,7 +3219,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
(unsigned long long)sh->sector);

spin_lock_irq(&sh->stripe_lock);
- sh->dev[dd_idx].write_hint = bi->bi_write_hint;
+ sh->dev[dd_idx].rw_hint = bi->bi_rw_hint;
/* Don't allow new IO added to stripes in batch list */
if (sh->batch_head)
goto overlap;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 8474c22..e9f0794 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -257,7 +257,7 @@ struct stripe_head {
sector_t sector; /* sector of this page */
unsigned long flags;
u32 log_checksum;
- unsigned short write_hint;
+ unsigned short rw_hint;
} dev[1]; /* allocated with extra space depending of RAID geometry */
};

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2e65be8..18f0824 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -516,7 +516,7 @@ static void nvme_assign_write_stream(struct nvme_ctrl *ctrl,
struct request *req, u16 *control,
u32 *dsmgmt)
{
- enum rw_hint streamid = req->write_hint;
+ enum rw_hint streamid = req->rw_hint;

if (streamid == WRITE_LIFE_NOT_SET || streamid == WRITE_LIFE_NONE)
streamid = 0;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index a80b4f0..cd6e154 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -214,7 +214,8 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
bio_init(&bio, vecs, nr_pages);
bio_set_dev(&bio, bdev);
bio.bi_iter.bi_sector = pos >> 9;
- bio.bi_write_hint = iocb->ki_hint;
+ if (iov_iter_rw(iter) == WRITE)
+ bio.bi_rw_hint = iocb->ki_hint;
bio.bi_private = current;
bio.bi_end_io = blkdev_bio_end_io_simple;
bio.bi_ioprio = iocb->ki_ioprio;
@@ -355,7 +356,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
for (;;) {
bio_set_dev(bio, bdev);
bio->bi_iter.bi_sector = pos >> 9;
- bio->bi_write_hint = iocb->ki_hint;
+ if (!is_read)
+ bio->bi_rw_hint = iocb->ki_hint;
bio->bi_private = dio;
bio->bi_end_io = blkdev_bio_end_io;
bio->bi_ioprio = iocb->ki_ioprio;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d228f70..3a9525e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2806,7 +2806,8 @@ static int submit_extent_page(unsigned int opf, struct extent_io_tree *tree,
bio_add_page(bio, page, page_size, pg_offset);
bio->bi_end_io = end_io_func;
bio->bi_private = tree;
- bio->bi_write_hint = page->mapping->host->i_write_hint;
+ if (opf & REQ_OP_WRITE)
+ bio->bi_rw_hint = page->mapping->host->i_write_hint;
bio->bi_opf = opf;
if (wbc) {
wbc_init_bio(wbc, bio);
diff --git a/fs/buffer.c b/fs/buffer.c
index 1286c2b..2959055 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3067,7 +3067,8 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,

bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio_set_dev(bio, bh->b_bdev);
- bio->bi_write_hint = write_hint;
+ if (REQ_OP_WRITE & op)
+ bio->bi_rw_hint = write_hint;

bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
BUG_ON(bio->bi_iter.bi_size != bh->b_size);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 722d17c..290b29e 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -445,7 +445,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
else
bio->bi_end_io = dio_bio_end_io;

- bio->bi_write_hint = dio->iocb->ki_hint;
+ if (dio->op == REQ_OP_WRITE)
+ bio->bi_rw_hint = dio->iocb->ki_hint;

sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index db75901..8d63174 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -351,7 +351,9 @@ void ext4_io_submit(struct ext4_io_submit *io)
if (bio) {
int io_op_flags = io->io_wbc->sync_mode == WB_SYNC_ALL ?
REQ_SYNC : 0;
- io->io_bio->bi_write_hint = io->io_end->inode->i_write_hint;
+ if (io->io_bio->bi_opf & REQ_OP_WRITE)
+ io->io_bio->bi_rw_hint =
+ io->io_end->inode->i_write_hint;
bio_set_op_attrs(io->io_bio, REQ_OP_WRITE, io_op_flags);
submit_bio(io->io_bio);
}
@@ -399,7 +401,8 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
ret = io_submit_init_bio(io, bh);
if (ret)
return ret;
- io->io_bio->bi_write_hint = inode->i_write_hint;
+ if (io->io_bio->bi_opf & REQ_OP_WRITE)
+ io->io_bio->bi_rw_hint = inode->i_write_hint;
}
ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
if (ret != bh->b_size)
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index b293cb3..5f9afa2 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -269,7 +269,7 @@ static struct bio *__bio_alloc(struct f2fs_sb_info *sbi, block_t blk_addr,
} else {
bio->bi_end_io = f2fs_write_end_io;
bio->bi_private = sbi;
- bio->bi_write_hint = f2fs_io_type_to_rw_hint(sbi, type, temp);
+ bio->bi_rw_hint = f2fs_io_type_to_rw_hint(sbi, type, temp);
}
if (wbc)
wbc_init_bio(wbc, bio);
diff --git a/fs/iomap.c b/fs/iomap.c
index 64ce240..8115475 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1637,7 +1637,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
bio = bio_alloc(GFP_KERNEL, nr_pages);
bio_set_dev(bio, iomap->bdev);
bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
- bio->bi_write_hint = dio->iocb->ki_hint;
+ if (dio->flags & IOMAP_DIO_WRITE)
+ bio->bi_rw_hint = dio->iocb->ki_hint;
bio->bi_ioprio = dio->iocb->ki_ioprio;
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
diff --git a/fs/mpage.c b/fs/mpage.c
index c820dc9..fd70ba7 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -639,7 +639,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
goto confused;

wbc_init_bio(wbc, bio);
- bio->bi_write_hint = inode->i_write_hint;
+ bio->bi_rw_hint = inode->i_write_hint;
}

/*
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 338b9d9..6dafcec 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -523,7 +523,7 @@ xfs_submit_ioend(
return status;
}

- ioend->io_bio->bi_write_hint = ioend->io_inode->i_write_hint;
+ ioend->io_bio->bi_rw_hint = ioend->io_inode->i_write_hint;
submit_bio(ioend->io_bio);
return 0;
}
@@ -577,7 +577,7 @@ xfs_chain_bio(
bio_chain(ioend->io_bio, new);
bio_get(ioend->io_bio); /* for xfs_destroy_ioend */
ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
- ioend->io_bio->bi_write_hint = ioend->io_inode->i_write_hint;
+ ioend->io_bio->bi_rw_hint = ioend->io_inode->i_write_hint;
submit_bio(ioend->io_bio);
ioend->io_bio = new;
}
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1dcf652..612e8a6 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -150,7 +150,7 @@ struct bio {
*/
unsigned short bi_flags; /* status, etc and bvec pool number */
unsigned short bi_ioprio;
- unsigned short bi_write_hint;
+ unsigned short bi_rw_hint;
blk_status_t bi_status;
u8 bi_partno;

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index fac35da..02179af 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -234,7 +234,7 @@ struct request {
unsigned short nr_integrity_segments;
#endif

- unsigned short write_hint;
+ unsigned short rw_hint;
unsigned short ioprio;

void *special; /* opaque pointer available for LLD use */
--
2.7.4


2018-11-28 03:55:29

by Allison Henderson

[permalink] [raw]
Subject: [PATCH v1 1/7] block: add nr_mirrors to request_queue

From: Bob Liu <[email protected]>

When fs data/metadata checksum mismatch, lower block devices may have other
correct copies. e.g if we did raid1 for protecting fs metadata.
Then fs could try other copies of metadata instead of panic, but fs need be
awared how many mirrors the block devices have and specify which mirror/copy to
retry.

This patch add @nr_mirrors to struct request_queue which is similar as
blk_queue_nonrot(), filesystem can grab device request queue and check the
number of mirrors of this block device.

@nr_mirrors is 0 by default which means no mirrors, drivers e.g raid1 are
responsible for setting the right value.

Also added helper functions for get/set the number of mirrors for a specific
device request queue.

Todo:
* Export nr_mirrors through /sysfs.

Signed-off-by: Bob Liu <[email protected]>
---
block/blk-core.c | 3 +++
block/blk-settings.c | 18 ++++++++++++++++++
include/linux/blkdev.h | 3 +++
3 files changed, 24 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index ce12515f..50779c8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1082,6 +1082,9 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id,
if (blkcg_init_queue(q))
goto fail_ref;

+ /* Set queue default mirrors to 0 explicitly. */
+ blk_queue_set_mirrors(q, 0);
+
return q;

fail_ref:
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 696c04c..6f4f9c7 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -904,6 +904,24 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
}
EXPORT_SYMBOL_GPL(blk_queue_write_cache);

+/*
+ * Get the number of read redundant mirrors.
+ */
+unsigned short blk_queue_get_mirrors(struct request_queue *q)
+{
+ return q->nr_mirrors;
+}
+EXPORT_SYMBOL(blk_queue_get_mirrors);
+
+/*
+ * Set the number of read redundant mirrors.
+ */
+void blk_queue_set_mirrors(struct request_queue *q, unsigned short mirrors)
+{
+ q->nr_mirrors = mirrors;
+}
+EXPORT_SYMBOL(blk_queue_set_mirrors);
+
static int __init blk_settings_init(void)
{
blk_max_low_pfn = max_low_pfn - 1;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4293dc1..fac35da 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -680,6 +680,7 @@ struct request_queue {

#define BLK_MAX_WRITE_HINTS 5
u64 write_hints[BLK_MAX_WRITE_HINTS];
+ unsigned short nr_mirrors; /* Default value is zero */
};

#define QUEUE_FLAG_QUEUED 0 /* uses generic tag queueing */
@@ -1267,6 +1268,8 @@ extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
extern void blk_queue_write_cache(struct request_queue *q, bool enabled, bool fua);
+extern unsigned short blk_queue_get_mirrors(struct request_queue *q);
+extern void blk_queue_set_mirrors(struct request_queue *q, unsigned short mirrors);

/*
* Number of physical segments as sent to the device.
--
2.7.4


2018-11-28 03:56:04

by Allison Henderson

[permalink] [raw]
Subject: [PATCH v1 3/7] md: raid1: handle bi_rw_hint accordingly

From: Bob Liu <[email protected]>

* nr_mirrors that raid1 device support should be @raid_disks, init it properly.
* Recording i/o went to which mirror in bio->bi_rw_hint.
* Read from specific real device if bi_rw_hint was set.

Todo:
* Support more drivers.

Signed-off-by: Bob Liu <[email protected]>
---
drivers/md/raid1.c | 31 ++++++++++++++++++++++++++++---
1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index fedf8c0..d2bdd0e 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -553,7 +553,8 @@ static sector_t align_to_barrier_unit_end(sector_t start_sector,
*
* The rdev for the device selected will have nr_pending incremented.
*/
-static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sectors)
+static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sectors,
+ unsigned short disk_hint)
{
const sector_t this_sector = r1_bio->sector;
int sectors;
@@ -566,6 +567,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
struct md_rdev *rdev;
int choose_first;
int choose_next_idle;
+ int max_disks;

rcu_read_lock();
/*
@@ -593,7 +595,20 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
else
choose_first = 0;

- for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
+ if (disk_hint) {
+ disk = disk_hint - 1;
+ /*
+ * Consider replacement as a special case, use original device to
+ * indicate which mirror this i/o was happened.
+ */
+ if (disk >= conf->raid_disks)
+ disk -= conf->raid_disks;
+ max_disks = disk + 1;
+ } else {
+ disk = 0;
+ max_disks = conf->raid_disks * 2;
+ }
+ for (; disk < max_disks; disk++) {
sector_t dist;
sector_t first_bad;
int bad_sectors;
@@ -1234,7 +1249,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
* make_request() can abort the operation when read-ahead is being
* used and no empty request is available.
*/
- rdisk = read_balance(conf, r1_bio, &max_sectors);
+ rdisk = read_balance(conf, r1_bio, &max_sectors, bio->bi_rw_hint);

if (rdisk < 0) {
/* couldn't find anywhere to read from */
@@ -1247,6 +1262,10 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
raid_end_bio_io(r1_bio);
return;
}
+
+ /* Recording i/o went to which real device. */
+ bio->bi_rw_hint = rdisk;
+
mirror = conf->mirrors + rdisk;

if (print_msg)
@@ -1279,6 +1298,11 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
r1_bio->read_disk = rdisk;

read_bio = bio_clone_fast(bio, gfp, &mddev->bio_set);
+ /*
+ * Clear bi_rw_hint, because it was set last i/o went to which real
+ * device.
+ */
+ read_bio->bi_rw_hint = 0;

r1_bio->bios[rdisk] = read_bio;

@@ -3078,6 +3102,7 @@ static int raid1_run(struct mddev *mddev)
if (mddev->queue) {
blk_queue_max_write_same_sectors(mddev->queue, 0);
blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
+ blk_queue_set_mirrors(mddev->queue, mddev->raid_disks);
}

rdev_for_each(rdev, mddev) {
--
2.7.4


2018-11-28 05:05:34

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v1 4/7] xfs: Add b_rw_hint to xfs_buf

On Tue, Nov 27, 2018 at 08:49:48PM -0700, Allison Henderson wrote:
> This patch adds a new field b_rw_hint to xfs_buf. We will
> need this to properly initialize the new bio->bi_rw_hint when
> submitting the read request. When the read completes, we
> then store the returned mirror in the b_rw_hint.
>
> Signed-off-by: Allison Henderson <[email protected]>
> ---
> fs/xfs/xfs_buf.c | 5 ++++-
> fs/xfs/xfs_buf.h | 7 +++++++
> 2 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index b21ea2b..dd8ba59 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1322,8 +1322,10 @@ xfs_buf_bio_end_io(
> if (!bp->b_error && xfs_buf_is_vmapped(bp) && (bp->b_flags & XBF_READ))
> invalidate_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp));
>
> - if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
> + if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
> + bp->b_rw_hint = bio->bi_rw_hint;
> xfs_buf_ioend_async(bp);
> + }
> bio_put(bio);
> }
>

This will miss setting bp->b_rw_hint for IO that completes before
submission returns to __xfs_buf_submit() (i.e. b_io_remaining is 2
at IO completion).

So I suspect it won't do the right thing on fast or synchronous
block devices like pmem. You should be able to tst this with a RAID1
made from two ramdisks...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-11-28 05:10:04

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v1 5/7] xfs: Add device retry

On Tue, Nov 27, 2018 at 08:49:49PM -0700, Allison Henderson wrote:
> Check to see if the _xfs_buf_read fails. If so loop over the
> available mirrors and retry the read
>
> Signed-off-by: Allison Henderson <[email protected]>
> ---
> fs/xfs/xfs_buf.c | 28 +++++++++++++++++++++++++++-
> 1 file changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index dd8ba59..f102d01 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -21,6 +21,7 @@
> #include <linux/migrate.h>
> #include <linux/backing-dev.h>
> #include <linux/freezer.h>
> +#include <linux/blkdev.h>
>
> #include "xfs_format.h"
> #include "xfs_log_format.h"
> @@ -808,6 +809,8 @@ xfs_buf_read_map(
> const struct xfs_buf_ops *ops)
> {
> struct xfs_buf *bp;
> + struct request_queue *q;
> + unsigned short i;
>
> flags |= XBF_READ;
>
> @@ -820,7 +823,30 @@ xfs_buf_read_map(
> if (!(bp->b_flags & XBF_DONE)) {
> XFS_STATS_INC(target->bt_mount, xb_get_read);
> bp->b_ops = ops;
> - _xfs_buf_read(bp, flags);
> + q = bdev_get_queue(bp->b_target->bt_bdev);
> +
> + /*
> + * Mirrors are indexed 1 - n, specified through the rw_hint.
> + * Setting the hint to 0 is unspecified and allows the block
> + * layer to decide.
> + */
> + for (i = 0; i <= blk_queue_get_mirrors(q); i++) {
> + bp->b_error = 0;
> + bp->b_rw_hint = i;
> + _xfs_buf_read(bp, flags);

So the first time through this loop the block layer devices what
device to read from, then we iterate devices 1..n on error.

Whihc means if device 0 is the only one with good information in it,
we may not ever actually read from it.

I'd suggest that a hint of "-1" (or equivalent max value) should be
used for "device selects mirror leg" rather than 0, so we can
actually read from the first device on command.

i.e.
bp->b_error = 0;
bp->b_rw_hint = -1;
_xfs_buf_read(bp, flags);

if (!bp->b_error)
return bp;

/* manual iteration to find a good copy */
for (i = 0; i <= blk_queue_get_mirrors(q); i++) {
bp->b_error = 0;
bp->b_rw_hint = i;
_xfs_buf_read(bp, flags);
......
> +
> + switch (bp->b_error) {
> + case -EIO:
> + case -EFSCORRUPTED:
> + case -EFSBADCRC:
> + /* loop again */
> + continue;
> + default:
> + goto retry_done;

Just return bp here, don't need a jump label for it.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-11-28 05:21:51

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v1 6/7] xfs: Rewrite retried read

On Tue, Nov 27, 2018 at 08:49:50PM -0700, Allison Henderson wrote:
> If we had to try more than one mirror to get a successful
> read, then write that buffer back to correct the bad mirro
>
> Signed-off-by: Allison Henderson <[email protected]>
> ---
> fs/xfs/xfs_buf.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index f102d01..81f6491 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -847,6 +847,14 @@ xfs_buf_read_map(
>
> }
> retry_done:
> +
> + /*
> + * if we had to try more than one mirror to sucessfully read
> + * the buffer, write the buffer back
> + */
> + if (!bp->b_error && i > 0)
> + xfs_bwrite(bp);
> +

This can go in the case statement on retry and then you don't need
to check for i > 0 or, well, bp->b_error. i.e.

swtich (bp->b_error) {
case -EBADCRC:
case -EIO:
case -EFSCORRUPTED:
/* try again from different copy */
continue;
0:
/* good copy, rewrite it to repair bad copy */
xfs_bwrite(bp);
/* fallthrough */
default:
return bp;
}

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-11-28 05:25:06

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v1 5/7] xfs: Add device retry

On Wed, Nov 28, 2018 at 04:08:50PM +1100, Dave Chinner wrote:
> On Tue, Nov 27, 2018 at 08:49:49PM -0700, Allison Henderson wrote:
> > Check to see if the _xfs_buf_read fails. If so loop over the
> > available mirrors and retry the read
> >
> > Signed-off-by: Allison Henderson <[email protected]>
> > ---
> > fs/xfs/xfs_buf.c | 28 +++++++++++++++++++++++++++-
> > 1 file changed, 27 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index dd8ba59..f102d01 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -21,6 +21,7 @@
> > #include <linux/migrate.h>
> > #include <linux/backing-dev.h>
> > #include <linux/freezer.h>
> > +#include <linux/blkdev.h>
> >
> > #include "xfs_format.h"
> > #include "xfs_log_format.h"
> > @@ -808,6 +809,8 @@ xfs_buf_read_map(
> > const struct xfs_buf_ops *ops)
> > {
> > struct xfs_buf *bp;
> > + struct request_queue *q;
> > + unsigned short i;
> >
> > flags |= XBF_READ;
> >
> > @@ -820,7 +823,30 @@ xfs_buf_read_map(
> > if (!(bp->b_flags & XBF_DONE)) {
> > XFS_STATS_INC(target->bt_mount, xb_get_read);
> > bp->b_ops = ops;
> > - _xfs_buf_read(bp, flags);
> > + q = bdev_get_queue(bp->b_target->bt_bdev);
> > +
> > + /*
> > + * Mirrors are indexed 1 - n, specified through the rw_hint.
> > + * Setting the hint to 0 is unspecified and allows the block
> > + * layer to decide.
> > + */
> > + for (i = 0; i <= blk_queue_get_mirrors(q); i++) {
> > + bp->b_error = 0;
> > + bp->b_rw_hint = i;
> > + _xfs_buf_read(bp, flags);
>
> So the first time through this loop the block layer devices what
> device to read from, then we iterate devices 1..n on error.
>
> Whihc means if device 0 is the only one with good information in it,
> we may not ever actually read from it.
>
> I'd suggest that a hint of "-1" (or equivalent max value) should be
> used for "device selects mirror leg" rather than 0, so we can
> actually read from the first device on command.

"read from the first device on command" => "set bio.bi_rw_hint = 1"...

> i.e.
> bp->b_error = 0;
> bp->b_rw_hint = -1;

...which is confusing. The intended behavior for this RFC (though not
so well documented) is that bi_rw_hint == 0 means "let the device
choose", and rw_hint > 1 means "choose mirror (rw_hint - 1)". That's
sort of an odd behavior because now we have:

blk_queue_get_mirrors(q) returns 5 (as in 5 mirrors) but we access the
5 mirrors as indices 1-5, not 0-4 like most programmers would probably
expect.

Also, I think it's probably necessary to create a #define to attach a
name to the "let the device choose" value...

#define BIO_RW_HINT_ANY_MIRROR (0)

for (i = BIO_RW_HINT_ANY_MIRROR; i <= blk_queue_get_mirrors(q); i++) {
...
bp->b_rw_hint = i;
...
_xfs_buf_read(bp, flags);
...
}

(or offset things -1 like you propose)

--D

> _xfs_buf_read(bp, flags);
>
> if (!bp->b_error)
> return bp;
>
> /* manual iteration to find a good copy */
> for (i = 0; i <= blk_queue_get_mirrors(q); i++) {
> bp->b_error = 0;
> bp->b_rw_hint = i;
> _xfs_buf_read(bp, flags);
> ......
> > +
> > + switch (bp->b_error) {
> > + case -EIO:
> > + case -EFSCORRUPTED:
> > + case -EFSBADCRC:
> > + /* loop again */
> > + continue;
> > + default:
> > + goto retry_done;
>
> Just return bp here, don't need a jump label for it.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2018-11-28 05:27:06

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v1 6/7] xfs: Rewrite retried read

On Wed, Nov 28, 2018 at 04:17:19PM +1100, Dave Chinner wrote:
> On Tue, Nov 27, 2018 at 08:49:50PM -0700, Allison Henderson wrote:
> > If we had to try more than one mirror to get a successful
> > read, then write that buffer back to correct the bad mirro
> >
> > Signed-off-by: Allison Henderson <[email protected]>
> > ---
> > fs/xfs/xfs_buf.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index f102d01..81f6491 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -847,6 +847,14 @@ xfs_buf_read_map(
> >
> > }
> > retry_done:
> > +
> > + /*
> > + * if we had to try more than one mirror to sucessfully read
> > + * the buffer, write the buffer back
> > + */
> > + if (!bp->b_error && i > 0)
> > + xfs_bwrite(bp);
> > +
>
> This can go in the case statement on retry and then you don't need
> to check for i > 0 or, well, bp->b_error. i.e.
>
> swtich (bp->b_error) {
> case -EBADCRC:
> case -EIO:
> case -EFSCORRUPTED:
> /* try again from different copy */
> continue;
> 0:
> /* good copy, rewrite it to repair bad copy */
> xfs_bwrite(bp);

Some day we might want to provide some controls for how long we'll retry
these reads and whether or not we automatically rewrite buffers, since
some administrators might prefer fast fail to get failover started.

(Not now though)

--D

> /* fallthrough */
> default:
> return bp;
> }
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2018-11-28 05:33:58

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry

On Tue, Nov 27, 2018 at 08:49:44PM -0700, Allison Henderson wrote:
> Motivation:
> When fs data/metadata checksum mismatch, lower block devices may have other
> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> decides that the metadata is garbage, today it will shut down the entire
> filesystem without trying any of the other mirrors. This is a severe
> loss of service, and we propose these patches to have XFS try harder to
> avoid failure.
>
> This patch prototype this mirror retry idea by:
> * Adding @nr_mirrors to struct request_queue which is similar as
> blk_queue_nonrot(), filesystem can grab device request queue and check max
> mirrors this block device has.
> Helper functions were also added to get/set the nr_mirrors.
>
> * Expanding bi_write_hint to bi_rw_hint, now @bi_rw_hint has three meanings.
> 1.Original write_hint.
> 2.end_io() will update @bi_rw_hint to reflect which mirror this i/o really happened.
> 3.Fs set @bi_rw_hint to force driver e.g raid1 read from a specific mirror.
>
> * Modify md/raid1 to support this retry feature.
>
> * Add b_rw_hint to xfs_buf
> This patch adds a new field b_rw_hint to xfs_buf. We will use this to set the
> new bio->bi_rw_hint when submitting the read request, and also to store the
> returned mirror when the read compleates

One thing that is going to make this more complex at the XFS layer
is discontiguous buffers. They require multiple IOs (and therefore
bios) and so we are going to need to ensure that all the bios use
the same bi_rw_hint.

This is another reason I suggest that bi_rw_hint has a magic value
for "block layer selects mirror" and separate the initial read from
the retry iterations. That allows us to let he block layer ot pick
whatever leg it wants for the initial read, but if we get a failure
we directly control the mirror we retry from and all bios in the
buffer go to that same mirror.

> We're not planning to take over all 16 bits of the read hint field; just looking for
> feedback about the sanity of the overall approach.

It seems conceptually simple enough - the biggest questions I have
are:

- how does propagation through stacked layers work?
- is it generic/abstract enough to be able to work with
RAID5/6 to trigger verification/recovery from the parity
information in the stripe?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-11-28 05:39:19

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v1 5/7] xfs: Add device retry

On Tue, Nov 27, 2018 at 09:22:45PM -0800, Darrick J. Wong wrote:
> On Wed, Nov 28, 2018 at 04:08:50PM +1100, Dave Chinner wrote:
> > On Tue, Nov 27, 2018 at 08:49:49PM -0700, Allison Henderson wrote:
> > > Check to see if the _xfs_buf_read fails. If so loop over the
> > > available mirrors and retry the read
> > >
> > > Signed-off-by: Allison Henderson <[email protected]>
> > > ---
> > > fs/xfs/xfs_buf.c | 28 +++++++++++++++++++++++++++-
> > > 1 file changed, 27 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index dd8ba59..f102d01 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -21,6 +21,7 @@
> > > #include <linux/migrate.h>
> > > #include <linux/backing-dev.h>
> > > #include <linux/freezer.h>
> > > +#include <linux/blkdev.h>
> > >
> > > #include "xfs_format.h"
> > > #include "xfs_log_format.h"
> > > @@ -808,6 +809,8 @@ xfs_buf_read_map(
> > > const struct xfs_buf_ops *ops)
> > > {
> > > struct xfs_buf *bp;
> > > + struct request_queue *q;
> > > + unsigned short i;
> > >
> > > flags |= XBF_READ;
> > >
> > > @@ -820,7 +823,30 @@ xfs_buf_read_map(
> > > if (!(bp->b_flags & XBF_DONE)) {
> > > XFS_STATS_INC(target->bt_mount, xb_get_read);
> > > bp->b_ops = ops;
> > > - _xfs_buf_read(bp, flags);
> > > + q = bdev_get_queue(bp->b_target->bt_bdev);
> > > +
> > > + /*
> > > + * Mirrors are indexed 1 - n, specified through the rw_hint.
> > > + * Setting the hint to 0 is unspecified and allows the block
> > > + * layer to decide.
> > > + */
> > > + for (i = 0; i <= blk_queue_get_mirrors(q); i++) {
> > > + bp->b_error = 0;
> > > + bp->b_rw_hint = i;
> > > + _xfs_buf_read(bp, flags);
> >
> > So the first time through this loop the block layer devices what
> > device to read from, then we iterate devices 1..n on error.
> >
> > Whihc means if device 0 is the only one with good information in it,
> > we may not ever actually read from it.
> >
> > I'd suggest that a hint of "-1" (or equivalent max value) should be
> > used for "device selects mirror leg" rather than 0, so we can
> > actually read from the first device on command.
>
> "read from the first device on command" => "set bio.bi_rw_hint = 1"...

Landmine.

> > i.e.
> > bp->b_error = 0;
> > bp->b_rw_hint = -1;
>
> ...which is confusing. The intended behavior for this RFC (though not
> so well documented) is that bi_rw_hint == 0 means "let the device
> choose", and rw_hint > 1 means "choose mirror (rw_hint - 1)". That's
> sort of an odd behavior because now we have:
>
> blk_queue_get_mirrors(q) returns 5 (as in 5 mirrors) but we access the
> 5 mirrors as indices 1-5, not 0-4 like most programmers would probably
> expect.

Yeah, that's not nice, and will lead to bugs in future as it trips
up people who have forgotten about this quirk.

> Also, I think it's probably necessary to create a #define to attach a
> name to the "let the device choose" value...
>
> #define BIO_RW_HINT_ANY_MIRROR (0)
>
> for (i = BIO_RW_HINT_ANY_MIRROR; i <= blk_queue_get_mirrors(q); i++) {
> ...
> bp->b_rw_hint = i;
> ...
> _xfs_buf_read(bp, flags);
> ...
> }

The recovery algorithms are only going to get more complex as
time goes on, so I'd really like to see an explicit separation of
the simple, unchanging fast path and the fallback recovery code.

Cheers,

dave.
--
Dave Chinner
[email protected]

2018-11-28 05:41:14

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v1 6/7] xfs: Rewrite retried read

On Tue, Nov 27, 2018 at 09:26:04PM -0800, Darrick J. Wong wrote:
> On Wed, Nov 28, 2018 at 04:17:19PM +1100, Dave Chinner wrote:
> > On Tue, Nov 27, 2018 at 08:49:50PM -0700, Allison Henderson wrote:
> > > If we had to try more than one mirror to get a successful
> > > read, then write that buffer back to correct the bad mirro
> > >
> > > Signed-off-by: Allison Henderson <[email protected]>
> > > ---
> > > fs/xfs/xfs_buf.c | 8 ++++++++
> > > 1 file changed, 8 insertions(+)
> > >
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index f102d01..81f6491 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -847,6 +847,14 @@ xfs_buf_read_map(
> > >
> > > }
> > > retry_done:
> > > +
> > > + /*
> > > + * if we had to try more than one mirror to sucessfully read
> > > + * the buffer, write the buffer back
> > > + */
> > > + if (!bp->b_error && i > 0)
> > > + xfs_bwrite(bp);
> > > +
> >
> > This can go in the case statement on retry and then you don't need
> > to check for i > 0 or, well, bp->b_error. i.e.
> >
> > swtich (bp->b_error) {
> > case -EBADCRC:
> > case -EIO:
> > case -EFSCORRUPTED:
> > /* try again from different copy */
> > continue;
> > 0:
> > /* good copy, rewrite it to repair bad copy */
> > xfs_bwrite(bp);
>
> Some day we might want to provide some controls for how long we'll retry
> these reads and whether or not we automatically rewrite buffers, since
> some administrators might prefer fast fail to get failover started.

Sure, but if the recovery code is trewn all through the read code,
it becomes a mess to untangle. isolate the recovery code as much as
possible, that way we can factor it out as it becomes more complex.

> (Not now though)

Which is exactly my point about future recovery complexity.... :P

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-11-28 05:50:43

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry

On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
> On Tue, Nov 27, 2018 at 08:49:44PM -0700, Allison Henderson wrote:
> > Motivation:
> > When fs data/metadata checksum mismatch, lower block devices may have other
> > correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> > decides that the metadata is garbage, today it will shut down the entire
> > filesystem without trying any of the other mirrors. This is a severe
> > loss of service, and we propose these patches to have XFS try harder to
> > avoid failure.
> >
> > This patch prototype this mirror retry idea by:
> > * Adding @nr_mirrors to struct request_queue which is similar as
> > blk_queue_nonrot(), filesystem can grab device request queue and check max
> > mirrors this block device has.
> > Helper functions were also added to get/set the nr_mirrors.
> >
> > * Expanding bi_write_hint to bi_rw_hint, now @bi_rw_hint has three meanings.
> > 1.Original write_hint.
> > 2.end_io() will update @bi_rw_hint to reflect which mirror this i/o really happened.
> > 3.Fs set @bi_rw_hint to force driver e.g raid1 read from a specific mirror.
> >
> > * Modify md/raid1 to support this retry feature.
> >
> > * Add b_rw_hint to xfs_buf
> > This patch adds a new field b_rw_hint to xfs_buf. We will use this to set the
> > new bio->bi_rw_hint when submitting the read request, and also to store the
> > returned mirror when the read compleates
>
> One thing that is going to make this more complex at the XFS layer
> is discontiguous buffers. They require multiple IOs (and therefore
> bios) and so we are going to need to ensure that all the bios use
> the same bi_rw_hint.

Hmm, we hadn't thought about that. What happens if we have a
discontiguous buffer mapped to multiple blocks, and there's only one
good copy of each block on separate disks in the whole array?

e.g. we have 8k directory blocks on a 4k block filesystem, only disk 0
has a good copy of block 0 and only disk 1 has a good copy of block 1?

I think we're just stuck with failing the whole thing because we can't
check the halves of the 8k block independently and there's too much of a
combinatoric explosion potential to try to mix and match.

> This is another reason I suggest that bi_rw_hint has a magic value
> for "block layer selects mirror" and separate the initial read from

(As mentioned in a previous reply of mine, setting rw_hint == 0 is the
magic value for "device picks mirror"...)

> the retry iterations. That allows us to let he block layer ot pick
> whatever leg it wants for the initial read, but if we get a failure
> we directly control the mirror we retry from and all bios in the
> buffer go to that same mirror.
>
> > We're not planning to take over all 16 bits of the read hint field; just looking for
> > feedback about the sanity of the overall approach.
>
> It seems conceptually simple enough - the biggest questions I have
> are:
>
> - how does propagation through stacked layers work?

Right now it doesn't, though once we work out how to make stacking work
through device mapper (my guess is that simple dm targets like linear
and crypt can set the mirror count to min(all underlying devices).

> - is it generic/abstract enough to be able to work with
> RAID5/6 to trigger verification/recovery from the parity
> information in the stripe?

In theory we could supply a raid5 implementation, wherein rw_hint == 0
lets the raid do as it pleases; rw_hint == 1 reads from the stripe; and
rw_hint == 2 forces stripe recovery for the given block.

A trickier scenario that I have no idea how to solve is the question of
how to handle dynamic redundancy levels. We don't have a standard bio
error value that means "this mirror is temporarily offline", so if you
have a raid1 of two disks and disk 0 goes offline, the retry loop in xfs
will hit the EIO and abort without even asking disk 1. It's also
unclear if we need to designate a second bio error value to mean "this
mirror is permanently gone".

[Also insert handwaving about whether or not online fsck will want to
control retries and automatic rewrite; I suspect the answer is that it
doesn't care.]

[[Also insert severe handwaving about do we expose this to userspace so
that xfs_repair can use it?]]

--D

> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2018-11-28 06:32:17

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry

On Tue, Nov 27, 2018 at 09:49:23PM -0800, Darrick J. Wong wrote:
> On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
> > On Tue, Nov 27, 2018 at 08:49:44PM -0700, Allison Henderson wrote:
> > > Motivation:
> > > When fs data/metadata checksum mismatch, lower block devices may have other
> > > correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> > > decides that the metadata is garbage, today it will shut down the entire
> > > filesystem without trying any of the other mirrors. This is a severe
> > > loss of service, and we propose these patches to have XFS try harder to
> > > avoid failure.
> > >
> > > This patch prototype this mirror retry idea by:
> > > * Adding @nr_mirrors to struct request_queue which is similar as
> > > blk_queue_nonrot(), filesystem can grab device request queue and check max
> > > mirrors this block device has.
> > > Helper functions were also added to get/set the nr_mirrors.
> > >
> > > * Expanding bi_write_hint to bi_rw_hint, now @bi_rw_hint has three meanings.
> > > 1.Original write_hint.
> > > 2.end_io() will update @bi_rw_hint to reflect which mirror this i/o really happened.
> > > 3.Fs set @bi_rw_hint to force driver e.g raid1 read from a specific mirror.
> > >
> > > * Modify md/raid1 to support this retry feature.
> > >
> > > * Add b_rw_hint to xfs_buf
> > > This patch adds a new field b_rw_hint to xfs_buf. We will use this to set the
> > > new bio->bi_rw_hint when submitting the read request, and also to store the
> > > returned mirror when the read compleates
> >
> > One thing that is going to make this more complex at the XFS layer
> > is discontiguous buffers. They require multiple IOs (and therefore
> > bios) and so we are going to need to ensure that all the bios use
> > the same bi_rw_hint.
>
> Hmm, we hadn't thought about that. What happens if we have a
> discontiguous buffer mapped to multiple blocks, and there's only one
> good copy of each block on separate disks in the whole array?
>
> e.g. we have 8k directory blocks on a 4k block filesystem, only disk 0
> has a good copy of block 0 and only disk 1 has a good copy of block 1?

Then the user has a disaster on their hands because they have
multiple failing disks.

> I think we're just stuck with failing the whole thing because we can't
> check the halves of the 8k block independently and there's too much of a
> combinatoric explosion potential to try to mix and match.

Yup, user needs to fix their storage before the filesystem can
attempt recovery.

> > > We're not planning to take over all 16 bits of the read hint field; just looking for
> > > feedback about the sanity of the overall approach.
> >
> > It seems conceptually simple enough - the biggest questions I have
> > are:
> >
> > - how does propagation through stacked layers work?
>
> Right now it doesn't, though once we work out how to make stacking work
> through device mapper (my guess is that simple dm targets like linear
> and crypt can set the mirror count to min(all underlying devices).
>
> > - is it generic/abstract enough to be able to work with
> > RAID5/6 to trigger verification/recovery from the parity
> > information in the stripe?
>
> In theory we could supply a raid5 implementation, wherein rw_hint == 0
> lets the raid do as it pleases; rw_hint == 1 reads from the stripe; and
> rw_hint == 2 forces stripe recovery for the given block.

So more magic numbers to define complex behaviours? :P

> A trickier scenario that I have no idea how to solve is the question of
> how to handle dynamic redundancy levels. We don't have a standard bio
> error value that means "this mirror is temporarily offline", so if you

We can get ETIMEDOUT, ENOLINK, EBUSY and EAGAIN from the block layer
which all indicate temporary errors (see blk_errors[]). Whether the
specific storage layers are actually using them is another matter...

> have a raid1 of two disks and disk 0 goes offline, the retry loop in xfs
> will hit the EIO and abort without even asking disk 1. It's also
> unclear if we need to designate a second bio error value to mean "this
> mirror is permanently gone".

If we have a mirror based retries, we should probably consider EIO
as "try next mirror", not as a hard failure.

> [Also insert handwaving about whether or not online fsck will want to
> control retries and automatic rewrite; I suspect the answer is that it
> doesn't care.]

Don't care - have the storage fix itself, then check what comes
back and fix it from there.

> [[Also insert severe handwaving about do we expose this to userspace so
> that xfs_repair can use it?]]

I suspect the answer there is through the AIO interfaces....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-11-28 07:16:06

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry

On Wed, Nov 28, 2018 at 05:30:46PM +1100, Dave Chinner wrote:
> On Tue, Nov 27, 2018 at 09:49:23PM -0800, Darrick J. Wong wrote:
> > On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
> > > On Tue, Nov 27, 2018 at 08:49:44PM -0700, Allison Henderson wrote:
> > > > Motivation:
> > > > When fs data/metadata checksum mismatch, lower block devices may have other
> > > > correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> > > > decides that the metadata is garbage, today it will shut down the entire
> > > > filesystem without trying any of the other mirrors. This is a severe
> > > > loss of service, and we propose these patches to have XFS try harder to
> > > > avoid failure.
> > > >
> > > > This patch prototype this mirror retry idea by:
> > > > * Adding @nr_mirrors to struct request_queue which is similar as
> > > > blk_queue_nonrot(), filesystem can grab device request queue and check max
> > > > mirrors this block device has.
> > > > Helper functions were also added to get/set the nr_mirrors.
> > > >
> > > > * Expanding bi_write_hint to bi_rw_hint, now @bi_rw_hint has three meanings.
> > > > 1.Original write_hint.
> > > > 2.end_io() will update @bi_rw_hint to reflect which mirror this i/o really happened.
> > > > 3.Fs set @bi_rw_hint to force driver e.g raid1 read from a specific mirror.
> > > >
> > > > * Modify md/raid1 to support this retry feature.
> > > >
> > > > * Add b_rw_hint to xfs_buf
> > > > This patch adds a new field b_rw_hint to xfs_buf. We will use this to set the
> > > > new bio->bi_rw_hint when submitting the read request, and also to store the
> > > > returned mirror when the read compleates
> > >
> > > One thing that is going to make this more complex at the XFS layer
> > > is discontiguous buffers. They require multiple IOs (and therefore
> > > bios) and so we are going to need to ensure that all the bios use
> > > the same bi_rw_hint.
> >
> > Hmm, we hadn't thought about that. What happens if we have a
> > discontiguous buffer mapped to multiple blocks, and there's only one
> > good copy of each block on separate disks in the whole array?
> >
> > e.g. we have 8k directory blocks on a 4k block filesystem, only disk 0
> > has a good copy of block 0 and only disk 1 has a good copy of block 1?
>
> Then the user has a disaster on their hands because they have
> multiple failing disks.

Or lives in the crazy modern age, where we have rapidly autodegrading
flash storage and hard disks whose heads pop off with no warning. :D

(But seriously, ugh.)

> > I think we're just stuck with failing the whole thing because we can't
> > check the halves of the 8k block independently and there's too much of a
> > combinatoric explosion potential to try to mix and match.
>
> Yup, user needs to fix their storage before the filesystem can
> attempt recovery.
>
> > > > We're not planning to take over all 16 bits of the read hint field; just looking for
> > > > feedback about the sanity of the overall approach.
> > >
> > > It seems conceptually simple enough - the biggest questions I have
> > > are:
> > >
> > > - how does propagation through stacked layers work?
> >
> > Right now it doesn't, though once we work out how to make stacking work
> > through device mapper (my guess is that simple dm targets like linear
> > and crypt can set the mirror count to min(all underlying devices).
> >
> > > - is it generic/abstract enough to be able to work with
> > > RAID5/6 to trigger verification/recovery from the parity
> > > information in the stripe?
> >
> > In theory we could supply a raid5 implementation, wherein rw_hint == 0
> > lets the raid do as it pleases; rw_hint == 1 reads from the stripe; and
> > rw_hint == 2 forces stripe recovery for the given block.
>
> So more magic numbers to define complex behaviours? :P

Yes!!!

I mean... you /could/ allow devices more expansive reporting of their
redundancy capabilities so that xfs could look at its read-retry-time
budget and try mirrors in decreasing order of likelihood of a good
response:

struct blkdev_redundancy_level {
unsigned latency; /* ms */
unsigned chance_of_success; /* 0 to 100 */
} redundancy_levels[blk_queue_get_mirrors()] = {
{ 10, 90 }, /* tries another mirror */
{ 300, 85 }, /* erasure decoding */
{ 7000, 30 }, /* long slow disk scraping via SCT ERC */
{ 1000000, 5 }, /* boils the oceans looking for data */
};

So at least the indices wouldn't be *completely* magic. But now we have
the question of how do you populate this table? And how many callers
are going to do something smarter than the dumb loop that it's worth the
extra code?

(Anyone? Now would be a great time to pipe up.)

> > A trickier scenario that I have no idea how to solve is the question of
> > how to handle dynamic redundancy levels. We don't have a standard bio
> > error value that means "this mirror is temporarily offline", so if you
>
> We can get ETIMEDOUT, ENOLINK, EBUSY and EAGAIN from the block layer
> which all indicate temporary errors (see blk_errors[]). Whether the
> specific storage layers are actually using them is another matter...

<nod>

> > have a raid1 of two disks and disk 0 goes offline, the retry loop in xfs
> > will hit the EIO and abort without even asking disk 1. It's also
> > unclear if we need to designate a second bio error value to mean "this
> > mirror is permanently gone".
>
> If we have a mirror based retries, we should probably consider EIO
> as "try next mirror", not as a hard failure.

Yeah.

> > [Also insert handwaving about whether or not online fsck will want to
> > control retries and automatic rewrite; I suspect the answer is that it
> > doesn't care.]
>
> Don't care - have the storage fix itself, then check what comes
> back and fix it from there.

<nod> Admittedly, the auto retry and rewrite are dependent solely on the
lack of EIO and the verifiers giving their blessing, and for the most
part online fsck doesn't go digging through buffers that don't pass the
verifiers, so it'll likely never see any of this anyway.

> > [[Also insert severe handwaving about do we expose this to userspace so
> > that xfs_repair can use it?]]
>
> I suspect the answer there is through the AIO interfaces....

Y{ay,uck}...

--D

> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2018-11-28 07:36:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v1 5/7] xfs: Add device retry

On Wed, Nov 28, 2018 at 04:08:50PM +1100, Dave Chinner wrote:
> So the first time through this loop the block layer devices what
> device to read from, then we iterate devices 1..n on error.
>
> Whihc means if device 0 is the only one with good information in it,
> we may not ever actually read from it.
>
> I'd suggest that a hint of "-1" (or equivalent max value) should be
> used for "device selects mirror leg" rather than 0, so we can
> actually read from the first device on command.

Yes. For one thing I think we really need to split this retry counter
of sorts from the write hints. I.e. make both u8 types and keep them
separate. Then start out with (u8)-1 as initialized by the block layer
for the first attempt. The device then fills out which leg it used
(in the completion path, so that another underlying driver doesn't
override it!), and then the file system just preserves this value on
a resumit, leaving the driver to chose a new value when it gets a
non -1 value.

2018-11-28 07:38:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry

On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
> One thing that is going to make this more complex at the XFS layer
> is discontiguous buffers. They require multiple IOs (and therefore
> bios) and so we are going to need to ensure that all the bios use
> the same bi_rw_hint.

Well, in case of raid 1 the load balancing code might actually
map different bios to different initial legs. What we really need
is to keep the 'index' or each bio. One good way to archive that
is to just reuse the bio for the retry instead of allocating a new one.

2018-11-28 07:46:36

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry

On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
> - how does propagation through stacked layers work?

The only way it works is by each layering driving it. Thus my
recommendation above bilding on your earlier one to use an index
that is filled by the driver at I/O completion time.

E.g.

bio_init: bi_leg = -1

raid1: submit bio to lower driver
raid 1 completion: set bi_leg to 0 or 1

Now if we want to allow stacking we need to save/restore bi_leg
before submitting to the underlying device. Which is possible,
but quite a bit of work in the drivers.

> - is it generic/abstract enough to be able to work with
> RAID5/6 to trigger verification/recovery from the parity
> information in the stripe?

If we get the non -1 bi_leg for paritity raid this is an inidicator
that parity rebuild needs to happen. For multi-parity setups we could
also use different levels there.

2018-11-28 07:47:24

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry

On Tue, Nov 27, 2018 at 11:37:22PM -0800, Christoph Hellwig wrote:
> On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
> > One thing that is going to make this more complex at the XFS layer
> > is discontiguous buffers. They require multiple IOs (and therefore
> > bios) and so we are going to need to ensure that all the bios use
> > the same bi_rw_hint.
>
> Well, in case of raid 1 the load balancing code might actually
> map different bios to different initial legs. What we really need
> is to keep the 'index' or each bio. One good way to archive that
> is to just reuse the bio for the retry instead of allocating a new one.

Not sure that is practical, because by the time we run the verifier
that discovers the error we've already released and freed all the
bios. And we don't know when we complete the individual bios whether
to kep it or not as the failure may occur in a bio that has not yet
completed.

Maybe we should be chaining bios for discontig buffers rather than
submitting them individually - that keeps the whole chain around
until all bios in the chain have completed, right?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-11-28 07:52:07

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry

On Wed, Nov 28, 2018 at 06:46:13PM +1100, Dave Chinner wrote:
> Maybe we should be chaining bios for discontig buffers rather than
> submitting them individually - that keeps the whole chain around
> until all bios in the chain have completed, right?

No, it doesn't. It just keeps the head of the chain around.

But we generally submit one buffer per map, only if each map was
bigger than BIO_MAX_PAGE * PAGE_SIZE we'd submit multiple bios.
That should always be bigger than our buffer sizes.

We also have the additional problem that a single bio submitted
by the file system can be split into multiple by the block layer,
which happens for raid 5 at least, but at least that splitting
is driven by the drivers make_request function, so it can do
smarts there.



2018-11-28 12:44:53

by Bob Liu

[permalink] [raw]
Subject: Re: [PATCH v1 5/7] xfs: Add device retry

On 11/28/18 3:35 PM, Christoph Hellwig wrote:
> On Wed, Nov 28, 2018 at 04:08:50PM +1100, Dave Chinner wrote:
>> So the first time through this loop the block layer devices what
>> device to read from, then we iterate devices 1..n on error.
>>
>> Whihc means if device 0 is the only one with good information in it,
>> we may not ever actually read from it.
>>
>> I'd suggest that a hint of "-1" (or equivalent max value) should be
>> used for "device selects mirror leg" rather than 0, so we can
>> actually read from the first device on command.
>
> Yes. For one thing I think we really need to split this retry counter
> of sorts from the write hints. I.e. make both u8 types and keep them
> separate. Then start out with (u8)-1 as initialized by the block layer
> for the first attempt. The device then fills out which leg it used
> (in the completion path, so that another underlying driver doesn't
> override it!), and then the file system just preserves this value on
> a resumit, leaving the driver to chose a new value when it gets a
> non -1 value.
>

Will update as suggested, thank you for all your feedback :)

-Bob

2018-11-28 16:50:17

by Allison Henderson

[permalink] [raw]
Subject: Re: [PATCH v1 5/7] xfs: Add device retry



On 11/28/18 5:41 AM, Bob Liu wrote:
> On 11/28/18 3:35 PM, Christoph Hellwig wrote:
>> On Wed, Nov 28, 2018 at 04:08:50PM +1100, Dave Chinner wrote:
>>> So the first time through this loop the block layer devices what
>>> device to read from, then we iterate devices 1..n on error.
>>>
>>> Whihc means if device 0 is the only one with good information in it,
>>> we may not ever actually read from it.
>>>
>>> I'd suggest that a hint of "-1" (or equivalent max value) should be
>>> used for "device selects mirror leg" rather than 0, so we can
>>> actually read from the first device on command.
>>
>> Yes. For one thing I think we really need to split this retry counter
>> of sorts from the write hints. I.e. make both u8 types and keep them
>> separate. Then start out with (u8)-1 as initialized by the block layer
>> for the first attempt. The device then fills out which leg it used
>> (in the completion path, so that another underlying driver doesn't
>> override it!), and then the file system just preserves this value on
>> a resumit, leaving the driver to chose a new value when it gets a
>> non -1 value.
>>
>
> Will update as suggested, thank you for all your feedback :)
>
> -Bob
>

Yes, thanks everyone for your feed back. Maybe Bob and I can come up
with some test cases that recreate the problem scenarios described here
and see if we can work out a solution to the multi bio complexities.
Thanks!

Allison

2018-11-28 19:40:02

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry

On Nov 27, 2018, at 10:49 PM, Darrick J. Wong <[email protected]> wrote:
> On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
>> On Tue, Nov 27, 2018 at 08:49:44PM -0700, Allison Henderson wrote:
>>> Motivation:
>>> When fs data/metadata checksum mismatch, lower block devices may have other
>>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1
>>> but decides that the metadata is garbage, today it will shut down the entire
>>> filesystem without trying any of the other mirrors. This is a severe
>>> loss of service, and we propose these patches to have XFS try harder to
>>> avoid failure.
>>>
>>> This patch prototype this mirror retry idea by:
>>> * Adding @nr_mirrors to struct request_queue which is similar as
>>> blk_queue_nonrot(), filesystem can grab device request queue and check max
>>> mirrors this block device has.
>>> Helper functions were also added to get/set the nr_mirrors.
>>>
>>> * Expanding bi_write_hint to bi_rw_hint, now @bi_rw_hint has three meanings.
>>> 1.Original write_hint.
>>> 2.end_io() will update @bi_rw_hint to reflect which mirror this i/o really happened.
>>> 3.Fs set @bi_rw_hint to force driver e.g raid1 read from a specific mirror.
>>>
>>> * Modify md/raid1 to support this retry feature.
>>>
>>> * Add b_rw_hint to xfs_buf
>>> This patch adds a new field b_rw_hint to xfs_buf. We will use this to set the
>>> new bio->bi_rw_hint when submitting the read request, and also to store the
>>> returned mirror when the read completes
>
>> the retry iterations. That allows us to let he block layer ot pick
>> whatever leg it wants for the initial read, but if we get a failure
>> we directly control the mirror we retry from and all bios in the
>> buffer go to that same mirror.
>> - is it generic/abstract enough to be able to work with
>> RAID5/6 to trigger verification/recovery from the parity
>> information in the stripe?
>
> In theory we could supply a raid5 implementation, wherein rw_hint == 0
> lets the raid do as it pleases; rw_hint == 1 reads from the stripe; and
> rw_hint == 2 forces stripe recovery for the given block.

Definitely this API needs to be useful for RAID-5/6 storage as well, and
I don't think that needs too complex an interface to achieve.

Basically, the "nr_mirrors" parameter would instead be "nr_retries" or
similar, so that the caller knows how many possible data combinations
there are to try and validate. For mirrors this is easy, and as it is
currently implemented. For RAID-5/6 this would essentially be the
number of data rebuild combinations in the RAID group (e.g. 8 in a
RAID-5 8+1 setup, and 16 in a RAID-6 8+2).

For each call with nr_retries != 0, the MD RAID-5/6 driver would skip
one of the data drives, and rebuild that part of the data from parity.
This wouldn't take too long, since the blocks are already in memory,
they just need the parity to be recomputed in a few different ways to
try and find a combination that returns valid data (e.g. if a drive
failed and the parity also has a latent corrupt sector, not uncommon).

The next step is to have an API that says "retry=N returned the correct
data, rebuild the parity/drive with that combination of devices" so
that the corrupt parity sector isn't used during the rebuild.

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2018-12-08 14:53:25

by Bob Liu

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry

On 11/28/18 3:45 PM, Christoph Hellwig wrote:
> On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
>> - how does propagation through stacked layers work?
>
> The only way it works is by each layering driving it. Thus my
> recommendation above bilding on your earlier one to use an index
> that is filled by the driver at I/O completion time.
>
> E.g.
>
> bio_init: bi_leg = -1
>
> raid1: submit bio to lower driver
> raid 1 completion: set bi_leg to 0 or 1
>
> Now if we want to allow stacking we need to save/restore bi_leg
> before submitting to the underlying device. Which is possible,
> but quite a bit of work in the drivers.
>

I found it's still very challenge while writing the code.
save/restore bi_leg may not enough because the drivers don't know how to do fs-metadata verify.

E.g two layer raid1 stacking

fs: md0(copies:2)
/ \
layer1/raid1 md1(copies:2) md2(copies:2)
/ \ / \
layer2/raid1 dev0 dev1 dev2 dev3

Assume dev2 is corrupted
=> md2: don't know how to do fs-metadata verify.
=> md0: fs verify fail, retry md1(preserve md2).
Then md2 will never be retried even dev3 may also has the right copy.
Unless the upper layer device(md0) can know the amount of copy is 4 instead of 2?
And need a way to handle the mapping.
Did I miss something? Thanks!

-Bob

>> - is it generic/abstract enough to be able to work with
>> RAID5/6 to trigger verification/recovery from the parity
>> information in the stripe?
>
> If we get the non -1 bi_leg for paritity raid this is an inidicator
> that parity rebuild needs to happen. For multi-parity setups we could
> also use different levels there.
>


2018-12-10 04:31:59

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry

On Sat, Dec 08, 2018 at 10:49:44PM +0800, Bob Liu wrote:
> On 11/28/18 3:45 PM, Christoph Hellwig wrote:
> > On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
> >> - how does propagation through stacked layers work?
> >
> > The only way it works is by each layering driving it. Thus my
> > recommendation above bilding on your earlier one to use an index
> > that is filled by the driver at I/O completion time.
> >
> > E.g.
> >
> > bio_init: bi_leg = -1
> >
> > raid1: submit bio to lower driver
> > raid 1 completion: set bi_leg to 0 or 1
> >
> > Now if we want to allow stacking we need to save/restore bi_leg
> > before submitting to the underlying device. Which is possible,
> > but quite a bit of work in the drivers.
> >
>
> I found it's still very challenge while writing the code.
> save/restore bi_leg may not enough because the drivers don't know how to do fs-metadata verify.
>
> E.g two layer raid1 stacking
>
> fs: md0(copies:2)
> / \
> layer1/raid1 md1(copies:2) md2(copies:2)
> / \ / \
> layer2/raid1 dev0 dev1 dev2 dev3
>
> Assume dev2 is corrupted
> => md2: don't know how to do fs-metadata verify.
> => md0: fs verify fail, retry md1(preserve md2).
> Then md2 will never be retried even dev3 may also has the right copy.
> Unless the upper layer device(md0) can know the amount of copy is 4 instead of 2?
> And need a way to handle the mapping.
> Did I miss something? Thanks!

<shrug> It seems reasonable to me that the raid1 layer should set the
number of retries to (number of raid1 mirrors) * min(retry count of all
mirrors) so that the upper layer device (md0) would advertise 4 retry
possibilities instead of 2.

--D


> -Bob
>
> >> - is it generic/abstract enough to be able to work with
> >> RAID5/6 to trigger verification/recovery from the parity
> >> information in the stripe?
> >
> > If we get the non -1 bi_leg for paritity raid this is an inidicator
> > that parity rebuild needs to happen. For multi-parity setups we could
> > also use different levels there.
> >
>