2010-08-12 12:44:43

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

This patchset replaces the current barrier implementation with
sequenced flush which doesn't impose any restriction on ordering
around the flush requests. This patchst is result of the following
discussion thread.

http://thread.gmane.org/gmane.linux.file-systems/43877

In summary, filesystems can take over the ordering of requests around
commit writes and the block layer should just supply a mechanism to
perform the commit writes themselves. This would greatly lessen tha
stall caused by queue dumping and draining used by the current barrier
implementation for request ordering.

This patchset converts barrier mechanism to sequenced flush/fua
mechanism in the following steps.

1. Kill the mostly unused ORDERED_BY_TAG support.

2. Deprecate REQ_HARDBARRIER support. All hard barrier requests are
failed with -EOPNOTSUPP.

3. Drop barrier ordering by queue draining mechanism.

4. Rename barrier to flush and implement new interface based on
REQ_FLUSH and REQ_FUA as suggested by Christoph.

blkdev_issue_flush() is converted to use the new mechanism but all the
filesystems still use the deprecated REQ_HARDBARRIER which always
fails. Each filesystem needs to be updated to enforce request
ordering themselves and then to use REQ_FLUSH/FUA mechanism.

loop, md, dm, etc... haven't been converted yet and REQ_FLUSH/FUA
doesn't work with them yet. I'll convert most of them soonish if this
patchset is generally agreed upon.

This patchset contains the following patches.

0001-block-loop-queue-ordered-mode-should-be-DRAIN_FLUSH.patch
0002-block-kill-QUEUE_ORDERED_BY_TAG.patch
0003-block-deprecate-barrier-and-replace-blk_queue_ordere.patch
0004-block-remove-spurious-uses-of-REQ_HARDBARRIER.patch
0005-block-misc-cleanups-in-barrier-code.patch
0006-block-drop-barrier-ordering-by-queue-draining.patch
0007-block-rename-blk-barrier.c-to-blk-flush.c.patch
0008-block-rename-barrier-ordered-to-flush.patch
0009-block-implement-REQ_FLUSH-FUA-based-interface-for-FL.patch
0010-fs-block-propagate-REQ_FLUSH-FUA-interface-to-upper-.patch
0011-block-use-REQ_FLUSH-in-blkdev_issue_flush.patch

and is also available in the following git tree.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua

and contains the following changes.

block/Makefile | 2
block/blk-barrier.c | 350 ------------------------------------
block/blk-core.c | 55 ++---
block/blk-flush.c | 248 +++++++++++++++++++++++++
block/blk-settings.c | 20 ++
block/blk.h | 8
block/elevator.c | 79 --------
drivers/block/brd.c | 1
drivers/block/loop.c | 2
drivers/block/osdblk.c | 5
drivers/block/pktcdvd.c | 1
drivers/block/ps3disk.c | 2
drivers/block/virtio_blk.c | 34 ---
drivers/block/xen-blkfront.c | 47 +---
drivers/ide/ide-disk.c | 13 -
drivers/md/dm.c | 2
drivers/mmc/card/queue.c | 1
drivers/s390/block/dasd.c | 1
drivers/scsi/aic7xxx_old.c | 21 --
drivers/scsi/libsas/sas_scsi_host.c | 13 -
drivers/scsi/sd.c | 18 -
fs/buffer.c | 27 +-
include/linux/blk_types.h | 4
include/linux/blkdev.h | 73 +------
include/linux/buffer_head.h | 8
include/linux/fs.h | 20 +-
include/scsi/scsi_tcq.h | 6
27 files changed, 402 insertions(+), 659 deletions(-)

Thanks.

--
tejun


2010-08-12 12:44:34

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/11] block: drop barrier ordering by queue draining

Filesystems will take all the responsibilities for ordering requests
around commit writes and will only indicate how the commit writes
themselves should be handled by block layers. This patch drops
barrier ordering by queue draining from block layer. Ordering by
draining implementation was somewhat invasive to request handling.
List of notable changes follow.

* Each queue has 1 bit color which is flipped on each barrier issue.
This is used to track whether a given request is issued before the
current barrier or not. REQ_ORDERED_COLOR flag and coloring
implementation in __elv_add_request() are removed.

* Requests which shouldn't be processed yet for draining were stalled
by returning -EAGAIN from blk_do_ordered() according to the test
result between blk_ordered_req_seq() and blk_blk_ordered_cur_seq().
This logic is removed.

* Draining completion logic in elv_completed_request() removed.

* All barrier sequence requests were queued to request queue and then
trckled to lower layer according to progress and thus maintaining
request orders during requeue was necessary. This is replaced by
queueing the next request in the barrier sequence only after the
current one is complete from blk_ordered_complete_seq(), which
removes the need for multiple proxy requests in struct request_queue
and the request sorting logic in the ELEVATOR_INSERT_REQUEUE path of
elv_insert().

* As barriers no longer have ordering constraints, there's no need to
dump the whole elevator onto the dispatch queue on each barrier.
Insert barriers at the front instead.

* If other barrier requests come to the front of the dispatch queue
while one is already in progress, they are stored in
q->pending_barriers and restored to dispatch queue one-by-one after
each barrier completion from blk_ordered_complete_seq().

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
block/blk-barrier.c | 220 ++++++++++++++++++---------------------------
block/blk-core.c | 11 ++-
block/blk.h | 2 +-
block/elevator.c | 79 ++--------------
include/linux/blk_types.h | 2 -
include/linux/blkdev.h | 19 ++---
6 files changed, 113 insertions(+), 220 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index f1be85b..e8b2e5c 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -9,6 +9,8 @@

#include "blk.h"

+static struct request *queue_next_ordseq(struct request_queue *q);
+
/*
* Cache flushing for ordered writes handling
*/
@@ -19,38 +21,10 @@ unsigned blk_ordered_cur_seq(struct request_queue *q)
return 1 << ffz(q->ordseq);
}

-unsigned blk_ordered_req_seq(struct request *rq)
-{
- struct request_queue *q = rq->q;
-
- BUG_ON(q->ordseq == 0);
-
- if (rq == &q->pre_flush_rq)
- return QUEUE_ORDSEQ_PREFLUSH;
- if (rq == &q->bar_rq)
- return QUEUE_ORDSEQ_BAR;
- if (rq == &q->post_flush_rq)
- return QUEUE_ORDSEQ_POSTFLUSH;
-
- /*
- * !fs requests don't need to follow barrier ordering. Always
- * put them at the front. This fixes the following deadlock.
- *
- * http://thread.gmane.org/gmane.linux.kernel/537473
- */
- if (rq->cmd_type != REQ_TYPE_FS)
- return QUEUE_ORDSEQ_DRAIN;
-
- if ((rq->cmd_flags & REQ_ORDERED_COLOR) ==
- (q->orig_bar_rq->cmd_flags & REQ_ORDERED_COLOR))
- return QUEUE_ORDSEQ_DRAIN;
- else
- return QUEUE_ORDSEQ_DONE;
-}
-
-bool blk_ordered_complete_seq(struct request_queue *q, unsigned seq, int error)
+static struct request *blk_ordered_complete_seq(struct request_queue *q,
+ unsigned seq, int error)
{
- struct request *rq;
+ struct request *next_rq = NULL;

if (error && !q->orderr)
q->orderr = error;
@@ -58,16 +32,22 @@ bool blk_ordered_complete_seq(struct request_queue *q, unsigned seq, int error)
BUG_ON(q->ordseq & seq);
q->ordseq |= seq;

- if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE)
- return false;
-
- /*
- * Okay, sequence complete.
- */
- q->ordseq = 0;
- rq = q->orig_bar_rq;
- __blk_end_request_all(rq, q->orderr);
- return true;
+ if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
+ /* not complete yet, queue the next ordered sequence */
+ next_rq = queue_next_ordseq(q);
+ } else {
+ /* complete this barrier request */
+ __blk_end_request_all(q->orig_bar_rq, q->orderr);
+ q->orig_bar_rq = NULL;
+ q->ordseq = 0;
+
+ /* dispatch the next barrier if there's one */
+ if (!list_empty(&q->pending_barriers)) {
+ next_rq = list_entry_rq(q->pending_barriers.next);
+ list_move(&next_rq->queuelist, &q->queue_head);
+ }
+ }
+ return next_rq;
}

static void pre_flush_end_io(struct request *rq, int error)
@@ -88,133 +68,105 @@ static void post_flush_end_io(struct request *rq, int error)
blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
}

-static void queue_flush(struct request_queue *q, unsigned which)
+static void queue_flush(struct request_queue *q, struct request *rq,
+ rq_end_io_fn *end_io)
{
- struct request *rq;
- rq_end_io_fn *end_io;
-
- if (which == QUEUE_ORDERED_DO_PREFLUSH) {
- rq = &q->pre_flush_rq;
- end_io = pre_flush_end_io;
- } else {
- rq = &q->post_flush_rq;
- end_io = post_flush_end_io;
- }
-
blk_rq_init(q, rq);
rq->cmd_type = REQ_TYPE_FS;
- rq->cmd_flags = REQ_HARDBARRIER | REQ_FLUSH;
+ rq->cmd_flags = REQ_FLUSH;
rq->rq_disk = q->orig_bar_rq->rq_disk;
rq->end_io = end_io;

elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
}

-static inline struct request *start_ordered(struct request_queue *q,
- struct request *rq)
+static struct request *queue_next_ordseq(struct request_queue *q)
{
- unsigned skip = 0;
-
- q->orderr = 0;
- q->ordered = q->next_ordered;
- q->ordseq |= QUEUE_ORDSEQ_STARTED;
-
- /*
- * For an empty barrier, there's no actual BAR request, which
- * in turn makes POSTFLUSH unnecessary. Mask them off.
- */
- if (!blk_rq_sectors(rq))
- q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
- QUEUE_ORDERED_DO_POSTFLUSH);
-
- /* stash away the original request */
- blk_dequeue_request(rq);
- q->orig_bar_rq = rq;
- rq = NULL;
-
- /*
- * Queue ordered sequence. As we stack them at the head, we
- * need to queue in reverse order. Note that we rely on that
- * no fs request uses ELEVATOR_INSERT_FRONT and thus no fs
- * request gets inbetween ordered sequence.
- */
- if (q->ordered & QUEUE_ORDERED_DO_POSTFLUSH) {
- queue_flush(q, QUEUE_ORDERED_DO_POSTFLUSH);
- rq = &q->post_flush_rq;
- } else
- skip |= QUEUE_ORDSEQ_POSTFLUSH;
+ struct request *rq = &q->bar_rq;

- if (q->ordered & QUEUE_ORDERED_DO_BAR) {
- rq = &q->bar_rq;
+ switch (blk_ordered_cur_seq(q)) {
+ case QUEUE_ORDSEQ_PREFLUSH:
+ queue_flush(q, rq, pre_flush_end_io);
+ break;

+ case QUEUE_ORDSEQ_BAR:
/* initialize proxy request and queue it */
blk_rq_init(q, rq);
init_request_from_bio(rq, q->orig_bar_rq->bio);
+ rq->cmd_flags &= ~REQ_HARDBARRIER;
if (q->ordered & QUEUE_ORDERED_DO_FUA)
rq->cmd_flags |= REQ_FUA;
rq->end_io = bar_end_io;

elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
- } else
- skip |= QUEUE_ORDSEQ_BAR;
+ break;

- if (q->ordered & QUEUE_ORDERED_DO_PREFLUSH) {
- queue_flush(q, QUEUE_ORDERED_DO_PREFLUSH);
- rq = &q->pre_flush_rq;
- } else
- skip |= QUEUE_ORDSEQ_PREFLUSH;
+ case QUEUE_ORDSEQ_POSTFLUSH:
+ queue_flush(q, rq, post_flush_end_io);
+ break;

- if (queue_in_flight(q))
- rq = NULL;
- else
- skip |= QUEUE_ORDSEQ_DRAIN;
-
- /*
- * Complete skipped sequences. If whole sequence is complete,
- * return %NULL to tell elevator that this request is gone.
- */
- if (blk_ordered_complete_seq(q, skip, 0))
- rq = NULL;
+ default:
+ BUG();
+ }
return rq;
}

struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
{
- const int is_barrier = rq->cmd_type == REQ_TYPE_FS &&
- (rq->cmd_flags & REQ_HARDBARRIER);
-
- if (!q->ordseq) {
- if (!is_barrier)
- return rq;
-
- if (q->next_ordered != QUEUE_ORDERED_NONE)
- return start_ordered(q, rq);
- else {
- /*
- * Queue ordering not supported. Terminate
- * with prejudice.
- */
- blk_dequeue_request(rq);
- __blk_end_request_all(rq, -EOPNOTSUPP);
- return NULL;
- }
+ unsigned skip = 0;
+
+ if (!(rq->cmd_flags & REQ_HARDBARRIER))
+ return rq;
+
+ if (q->ordseq) {
+ /*
+ * Barrier is already in progress and they can't be
+ * processed in parallel. Queue for later processing.
+ */
+ list_move_tail(&rq->queuelist, &q->pending_barriers);
+ return NULL;
+ }
+
+ if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) {
+ /*
+ * Queue ordering not supported. Terminate
+ * with prejudice.
+ */
+ blk_dequeue_request(rq);
+ __blk_end_request_all(rq, -EOPNOTSUPP);
+ return NULL;
}

/*
- * Ordered sequence in progress
+ * Start a new ordered sequence
*/
+ q->orderr = 0;
+ q->ordered = q->next_ordered;
+ q->ordseq |= QUEUE_ORDSEQ_STARTED;

- /* Special requests are not subject to ordering rules. */
- if (rq->cmd_type != REQ_TYPE_FS &&
- rq != &q->pre_flush_rq && rq != &q->post_flush_rq)
- return rq;
+ /*
+ * For an empty barrier, there's no actual BAR request, which
+ * in turn makes POSTFLUSH unnecessary. Mask them off.
+ */
+ if (!blk_rq_sectors(rq))
+ q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
+ QUEUE_ORDERED_DO_POSTFLUSH);

- /* Ordered by draining. Wait for turn. */
- WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
- if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
- rq = ERR_PTR(-EAGAIN);
+ /* stash away the original request */
+ blk_dequeue_request(rq);
+ q->orig_bar_rq = rq;

- return rq;
+ if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
+ skip |= QUEUE_ORDSEQ_PREFLUSH;
+
+ if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
+ skip |= QUEUE_ORDSEQ_BAR;
+
+ if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
+ skip |= QUEUE_ORDSEQ_POSTFLUSH;
+
+ /* complete skipped sequences and return the first sequence */
+ return blk_ordered_complete_seq(q, skip, 0);
}

static void bio_end_empty_barrier(struct bio *bio, int err)
diff --git a/block/blk-core.c b/block/blk-core.c
index ed8ef89..82bd6d9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -520,6 +520,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
init_timer(&q->unplug_timer);
setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
INIT_LIST_HEAD(&q->timeout_list);
+ INIT_LIST_HEAD(&q->pending_barriers);
INIT_WORK(&q->unplug_work, blk_unplug_work);

kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1185,6 +1186,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
const bool sync = (bio->bi_rw & REQ_SYNC);
const bool unplug = (bio->bi_rw & REQ_UNPLUG);
const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
+ int where = ELEVATOR_INSERT_SORT;
int rw_flags;

/* REQ_HARDBARRIER is no more */
@@ -1203,7 +1205,12 @@ static int __make_request(struct request_queue *q, struct bio *bio)

spin_lock_irq(q->queue_lock);

- if (unlikely((bio->bi_rw & REQ_HARDBARRIER)) || elv_queue_empty(q))
+ if (bio->bi_rw & REQ_HARDBARRIER) {
+ where = ELEVATOR_INSERT_FRONT;
+ goto get_rq;
+ }
+
+ if (elv_queue_empty(q))
goto get_rq;

el_ret = elv_merge(q, &req, bio);
@@ -1303,7 +1310,7 @@ get_rq:

/* insert the request into the elevator */
drive_stat_acct(req, 1);
- __elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
+ __elv_add_request(q, req, where, 0);
out:
if (unplug || !queue_should_plug(q))
__generic_unplug_device(q);
diff --git a/block/blk.h b/block/blk.h
index 874eb4e..08081e4 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -62,7 +62,7 @@ static inline struct request *__elv_next_request(struct request_queue *q)
rq = list_entry_rq(q->queue_head.next);
rq = blk_do_ordered(q, rq);
if (rq)
- return !IS_ERR(rq) ? rq : NULL;
+ return rq;
}

if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
diff --git a/block/elevator.c b/block/elevator.c
index 816a7c8..22c46b5 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -611,8 +611,6 @@ void elv_quiesce_end(struct request_queue *q)

void elv_insert(struct request_queue *q, struct request *rq, int where)
{
- struct list_head *pos;
- unsigned ordseq;
int unplug_it = 1;

trace_block_rq_insert(q, rq);
@@ -620,9 +618,16 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
rq->q = q;

switch (where) {
+ case ELEVATOR_INSERT_REQUEUE:
+ /*
+ * Most requeues happen because of a busy condition,
+ * don't force unplug of the queue for that case.
+ * Clear unplug_it and fall through.
+ */
+ unplug_it = 0;
+
case ELEVATOR_INSERT_FRONT:
rq->cmd_flags |= REQ_SOFTBARRIER;
-
list_add(&rq->queuelist, &q->queue_head);
break;

@@ -662,36 +667,6 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
q->elevator->ops->elevator_add_req_fn(q, rq);
break;

- case ELEVATOR_INSERT_REQUEUE:
- /*
- * If ordered flush isn't in progress, we do front
- * insertion; otherwise, requests should be requeued
- * in ordseq order.
- */
- rq->cmd_flags |= REQ_SOFTBARRIER;
-
- /*
- * Most requeues happen because of a busy condition,
- * don't force unplug of the queue for that case.
- */
- unplug_it = 0;
-
- if (q->ordseq == 0) {
- list_add(&rq->queuelist, &q->queue_head);
- break;
- }
-
- ordseq = blk_ordered_req_seq(rq);
-
- list_for_each(pos, &q->queue_head) {
- struct request *pos_rq = list_entry_rq(pos);
- if (ordseq <= blk_ordered_req_seq(pos_rq))
- break;
- }
-
- list_add_tail(&rq->queuelist, pos);
- break;
-
default:
printk(KERN_ERR "%s: bad insertion point %d\n",
__func__, where);
@@ -710,26 +685,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
void __elv_add_request(struct request_queue *q, struct request *rq, int where,
int plug)
{
- if (q->ordcolor)
- rq->cmd_flags |= REQ_ORDERED_COLOR;
-
if (rq->cmd_flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER)) {
- /*
- * toggle ordered color
- */
- if (rq->cmd_flags & REQ_HARDBARRIER)
- q->ordcolor ^= 1;
-
- /*
- * barriers implicitly indicate back insertion
- */
- if (where == ELEVATOR_INSERT_SORT)
- where = ELEVATOR_INSERT_BACK;
-
- /*
- * this request is scheduling boundary, update
- * end_sector
- */
+ /* barriers are scheduling boundary, update end_sector */
if (rq->cmd_type == REQ_TYPE_FS ||
(rq->cmd_flags & REQ_DISCARD)) {
q->end_sector = rq_end_sector(rq);
@@ -849,24 +806,6 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
e->ops->elevator_completed_req_fn)
e->ops->elevator_completed_req_fn(q, rq);
}
-
- /*
- * Check if the queue is waiting for fs requests to be
- * drained for flush sequence.
- */
- if (unlikely(q->ordseq)) {
- struct request *next = NULL;
-
- if (!list_empty(&q->queue_head))
- next = list_entry_rq(q->queue_head.next);
-
- if (!queue_in_flight(q) &&
- blk_ordered_cur_seq(q) == QUEUE_ORDSEQ_DRAIN &&
- (!next || blk_ordered_req_seq(next) > QUEUE_ORDSEQ_DRAIN)) {
- blk_ordered_complete_seq(q, QUEUE_ORDSEQ_DRAIN, 0);
- __blk_run_queue(q);
- }
- }
}

#define to_elv(atr) container_of((atr), struct elv_fs_entry, attr)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1185237..8e9887d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -141,7 +141,6 @@ enum rq_flag_bits {
__REQ_FAILED, /* set if the request failed */
__REQ_QUIET, /* don't worry about errors */
__REQ_PREEMPT, /* set for "ide_preempt" requests */
- __REQ_ORDERED_COLOR, /* is before or after barrier */
__REQ_ALLOCED, /* request came from our alloc pool */
__REQ_COPY_USER, /* contains copies of user pages */
__REQ_INTEGRITY, /* integrity metadata has been remapped */
@@ -181,7 +180,6 @@ enum rq_flag_bits {
#define REQ_FAILED (1 << __REQ_FAILED)
#define REQ_QUIET (1 << __REQ_QUIET)
#define REQ_PREEMPT (1 << __REQ_PREEMPT)
-#define REQ_ORDERED_COLOR (1 << __REQ_ORDERED_COLOR)
#define REQ_ALLOCED (1 << __REQ_ALLOCED)
#define REQ_COPY_USER (1 << __REQ_COPY_USER)
#define REQ_INTEGRITY (1 << __REQ_INTEGRITY)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 21baa19..522ecda 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -360,9 +360,10 @@ struct request_queue
unsigned int flush_flags;

unsigned int ordered, next_ordered, ordseq;
- int orderr, ordcolor;
- struct request pre_flush_rq, bar_rq, post_flush_rq;
+ int orderr;
+ struct request bar_rq;
struct request *orig_bar_rq;
+ struct list_head pending_barriers;

struct mutex sysfs_lock;

@@ -490,12 +491,11 @@ enum {
/*
* Ordered operation sequence
*/
- QUEUE_ORDSEQ_STARTED = 0x01, /* flushing in progress */
- QUEUE_ORDSEQ_DRAIN = 0x02, /* waiting for the queue to be drained */
- QUEUE_ORDSEQ_PREFLUSH = 0x04, /* pre-flushing in progress */
- QUEUE_ORDSEQ_BAR = 0x08, /* original barrier req in progress */
- QUEUE_ORDSEQ_POSTFLUSH = 0x10, /* post-flushing in progress */
- QUEUE_ORDSEQ_DONE = 0x20,
+ QUEUE_ORDSEQ_STARTED = (1 << 0), /* flushing in progress */
+ QUEUE_ORDSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */
+ QUEUE_ORDSEQ_BAR = (1 << 2), /* barrier write in progress */
+ QUEUE_ORDSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */
+ QUEUE_ORDSEQ_DONE = (1 << 4),
};

#define blk_queue_plugged(q) test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
@@ -867,9 +867,6 @@ extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
-extern unsigned blk_ordered_cur_seq(struct request_queue *);
-extern unsigned blk_ordered_req_seq(struct request *);
-extern bool blk_ordered_complete_seq(struct request_queue *, unsigned, int);

extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
extern void blk_dump_rq_flags(struct request *, char *);
--
1.7.1

2010-08-12 12:44:39

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 11/11] block: use REQ_FLUSH in blkdev_issue_flush()

Update blkdev_issue_flush() to use new REQ_FLUSH interface.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
block/blk-flush.c | 17 ++++++-----------
1 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 452c552..ab765c2 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -174,13 +174,10 @@ struct request *blk_do_flush(struct request_queue *q, struct request *rq)
return blk_flush_complete_seq(q, skip, 0);
}

-static void bio_end_empty_barrier(struct bio *bio, int err)
+static void bio_end_flush(struct bio *bio, int err)
{
- if (err) {
- if (err == -EOPNOTSUPP)
- set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
+ if (err)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
- }
if (bio->bi_private)
complete(bio->bi_private);
bio_put(bio);
@@ -218,19 +215,19 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
* some block devices may not have their queue correctly set up here
* (e.g. loop device without a backing file) and so issuing a flush
* here will panic. Ensure there is a request function before issuing
- * the barrier.
+ * the flush.
*/
if (!q->make_request_fn)
return -ENXIO;

bio = bio_alloc(gfp_mask, 0);
- bio->bi_end_io = bio_end_empty_barrier;
+ bio->bi_end_io = bio_end_flush;
bio->bi_bdev = bdev;
if (test_bit(BLKDEV_WAIT, &flags))
bio->bi_private = &wait;

bio_get(bio);
- submit_bio(WRITE_BARRIER, bio);
+ submit_bio(WRITE_FLUSH, bio);
if (test_bit(BLKDEV_WAIT, &flags)) {
wait_for_completion(&wait);
/*
@@ -242,9 +239,7 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
*error_sector = bio->bi_sector;
}

- if (bio_flagged(bio, BIO_EOPNOTSUPP))
- ret = -EOPNOTSUPP;
- else if (!bio_flagged(bio, BIO_UPTODATE))
+ if (!bio_flagged(bio, BIO_UPTODATE))
ret = -EIO;

bio_put(bio);
--
1.7.1

2010-08-12 12:44:53

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/11] block: remove spurious uses of REQ_HARDBARRIER

REQ_HARDBARRIER is deprecated. Remove spurious uses in the following
users. Please note that other than osdblk, all other uses were
already spurious before deprecation.

* osdblk: osdblk_rq_fn() won't receive any request with
REQ_HARDBARRIER set. Remove the test for it.

* pktcdvd: use of REQ_HARDBARRIER in pkt_generic_packet() doesn't mean
anything. Removed.

* aic7xxx_old: Setting MSG_ORDERED_Q_TAG on REQ_HARDBARRIER is
spurious. Removed.

* sas_scsi_host: Setting TASK_ATTR_ORDERED on REQ_HARDBARRIER is
spurious. Removed.

* scsi_tcq: The ordered tag path wasn't being used anyway. Removed.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Boaz Harrosh <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Peter Osterlund <[email protected]>
---
drivers/block/osdblk.c | 3 +--
drivers/block/pktcdvd.c | 1 -
drivers/scsi/aic7xxx_old.c | 21 ++-------------------
drivers/scsi/libsas/sas_scsi_host.c | 13 +------------
include/scsi/scsi_tcq.h | 6 +-----
5 files changed, 5 insertions(+), 39 deletions(-)

diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index 72d6246..87311eb 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
@@ -310,8 +310,7 @@ static void osdblk_rq_fn(struct request_queue *q)
break;

/* filter out block requests we don't understand */
- if (rq->cmd_type != REQ_TYPE_FS &&
- !(rq->cmd_flags & REQ_HARDBARRIER)) {
+ if (rq->cmd_type != REQ_TYPE_FS) {
blk_end_request_all(rq, 0);
continue;
}
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index b1cbeb5..0166ea1 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -753,7 +753,6 @@ static int pkt_generic_packet(struct pktcdvd_device *pd, struct packet_command *

rq->timeout = 60*HZ;
rq->cmd_type = REQ_TYPE_BLOCK_PC;
- rq->cmd_flags |= REQ_HARDBARRIER;
if (cgc->quiet)
rq->cmd_flags |= REQ_QUIET;

diff --git a/drivers/scsi/aic7xxx_old.c b/drivers/scsi/aic7xxx_old.c
index 93984c9..e1cd606 100644
--- a/drivers/scsi/aic7xxx_old.c
+++ b/drivers/scsi/aic7xxx_old.c
@@ -2850,12 +2850,6 @@ aic7xxx_done(struct aic7xxx_host *p, struct aic7xxx_scb *scb)
aic_dev->r_total++;
ptr = aic_dev->r_bins;
}
- if(cmd->device->simple_tags && cmd->request->cmd_flags & REQ_HARDBARRIER)
- {
- aic_dev->barrier_total++;
- if(scb->tag_action == MSG_ORDERED_Q_TAG)
- aic_dev->ordered_total++;
- }
x = scb->sg_length;
x >>= 10;
for(i=0; i<6; i++)
@@ -10144,19 +10138,8 @@ static void aic7xxx_buildscb(struct aic7xxx_host *p, struct scsi_cmnd *cmd,
/* We always force TEST_UNIT_READY to untagged */
if (cmd->cmnd[0] != TEST_UNIT_READY && sdptr->simple_tags)
{
- if (req->cmd_flags & REQ_HARDBARRIER)
- {
- if(sdptr->ordered_tags)
- {
- hscb->control |= MSG_ORDERED_Q_TAG;
- scb->tag_action = MSG_ORDERED_Q_TAG;
- }
- }
- else
- {
- hscb->control |= MSG_SIMPLE_Q_TAG;
- scb->tag_action = MSG_SIMPLE_Q_TAG;
- }
+ hscb->control |= MSG_SIMPLE_Q_TAG;
+ scb->tag_action = MSG_SIMPLE_Q_TAG;
}
}
if ( !(aic_dev->dtr_pending) &&
diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c
index f0cfba9..535085c 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -130,17 +130,6 @@ static void sas_scsi_task_done(struct sas_task *task)
sc->scsi_done(sc);
}

-static enum task_attribute sas_scsi_get_task_attr(struct scsi_cmnd *cmd)
-{
- enum task_attribute ta = TASK_ATTR_SIMPLE;
- if (cmd->request && blk_rq_tagged(cmd->request)) {
- if (cmd->device->ordered_tags &&
- (cmd->request->cmd_flags & REQ_HARDBARRIER))
- ta = TASK_ATTR_ORDERED;
- }
- return ta;
-}
-
static struct sas_task *sas_create_task(struct scsi_cmnd *cmd,
struct domain_device *dev,
gfp_t gfp_flags)
@@ -160,7 +149,7 @@ static struct sas_task *sas_create_task(struct scsi_cmnd *cmd,
task->ssp_task.retry_count = 1;
int_to_scsilun(cmd->device->lun, &lun);
memcpy(task->ssp_task.LUN, &lun.scsi_lun, 8);
- task->ssp_task.task_attr = sas_scsi_get_task_attr(cmd);
+ task->ssp_task.task_attr = TASK_ATTR_SIMPLE;
memcpy(task->ssp_task.cdb, cmd->cmnd, 16);

task->scatter = scsi_sglist(cmd);
diff --git a/include/scsi/scsi_tcq.h b/include/scsi/scsi_tcq.h
index 1723138..d6e7994 100644
--- a/include/scsi/scsi_tcq.h
+++ b/include/scsi/scsi_tcq.h
@@ -97,13 +97,9 @@ static inline void scsi_deactivate_tcq(struct scsi_device *sdev, int depth)
static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
{
struct request *req = cmd->request;
- struct scsi_device *sdev = cmd->device;

if (blk_rq_tagged(req)) {
- if (sdev->ordered_tags && req->cmd_flags & REQ_HARDBARRIER)
- *msg++ = MSG_ORDERED_TAG;
- else
- *msg++ = MSG_SIMPLE_TAG;
+ *msg++ = MSG_SIMPLE_TAG;
*msg++ = req->tag;
return 2;
}
--
1.7.1

2010-08-12 12:45:12

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/11] block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush()

Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
requests. Deprecate barrier. All REQ_HARDBARRIERs are failed with
-EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
blk_queue_flush().

blk_queue_flush() takes combinations of REQ_FLUSH and FUA. If a
device has write cache and can flush it, it should set REQ_FLUSH. If
the device can handle FUA writes, it should also set REQ_FUA.

All blk_queue_ordered() users are converted.

* ORDERED_DRAIN is mapped to 0 which is the default value.
* ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH.
* ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
Cc: Chris Wright <[email protected]>
Cc: FUJITA Tomonori <[email protected]>
Cc: Boaz Harrosh <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Alasdair G Kergon <[email protected]>
Cc: Pierre Ossman <[email protected]>
Cc: Stefan Weinhuber <[email protected]>
---
block/blk-barrier.c | 29 ----------------------------
block/blk-core.c | 6 +++-
block/blk-settings.c | 20 +++++++++++++++++++
drivers/block/brd.c | 1 -
drivers/block/loop.c | 2 +-
drivers/block/osdblk.c | 2 +-
drivers/block/ps3disk.c | 2 +-
drivers/block/virtio_blk.c | 25 ++++++++---------------
drivers/block/xen-blkfront.c | 43 +++++++++++------------------------------
drivers/ide/ide-disk.c | 13 +++++------
drivers/md/dm.c | 2 +-
drivers/mmc/card/queue.c | 1 -
drivers/s390/block/dasd.c | 1 -
drivers/scsi/sd.c | 16 +++++++-------
include/linux/blkdev.h | 6 +++-
15 files changed, 67 insertions(+), 102 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index c807e9c..ed0aba5 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -9,35 +9,6 @@

#include "blk.h"

-/**
- * blk_queue_ordered - does this queue support ordered writes
- * @q: the request queue
- * @ordered: one of QUEUE_ORDERED_*
- *
- * Description:
- * For journalled file systems, doing ordered writes on a commit
- * block instead of explicitly doing wait_on_buffer (which is bad
- * for performance) can be a big win. Block drivers supporting this
- * feature should call this function and indicate so.
- *
- **/
-int blk_queue_ordered(struct request_queue *q, unsigned ordered)
-{
- if (ordered != QUEUE_ORDERED_NONE &&
- ordered != QUEUE_ORDERED_DRAIN &&
- ordered != QUEUE_ORDERED_DRAIN_FLUSH &&
- ordered != QUEUE_ORDERED_DRAIN_FUA) {
- printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
- return -EINVAL;
- }
-
- q->ordered = ordered;
- q->next_ordered = ordered;
-
- return 0;
-}
-EXPORT_SYMBOL(blk_queue_ordered);
-
/*
* Cache flushing for ordered writes handling
*/
diff --git a/block/blk-core.c b/block/blk-core.c
index 5ab3ac2..3f802dd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1203,11 +1203,13 @@ static int __make_request(struct request_queue *q, struct bio *bio)
const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
int rw_flags;

- if ((bio->bi_rw & REQ_HARDBARRIER) &&
- (q->next_ordered == QUEUE_ORDERED_NONE)) {
+ /* REQ_HARDBARRIER is no more */
+ if (WARN_ONCE(bio->bi_rw & REQ_HARDBARRIER,
+ "block: HARDBARRIER is deprecated, use FLUSH/FUA instead\n")) {
bio_endio(bio, -EOPNOTSUPP);
return 0;
}
+
/*
* low level driver can indicate that it wants pages above a
* certain limit bounced to low memory (ie for highmem, or even
diff --git a/block/blk-settings.c b/block/blk-settings.c
index a234f4b..9b18afc 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -794,6 +794,26 @@ void blk_queue_update_dma_alignment(struct request_queue *q, int mask)
}
EXPORT_SYMBOL(blk_queue_update_dma_alignment);

+/**
+ * blk_queue_flush - configure queue's cache flush capability
+ * @q: the request queue for the device
+ * @flush: 0, REQ_FLUSH or REQ_FLUSH | REQ_FUA
+ *
+ * Tell block layer cache flush capability of @q. If it supports
+ * flushing, REQ_FLUSH should be set. If it supports bypassing
+ * write cache for individual writes, REQ_FUA should be set.
+ */
+void blk_queue_flush(struct request_queue *q, unsigned int flush)
+{
+ WARN_ON_ONCE(flush & ~(REQ_FLUSH | REQ_FUA));
+
+ if (WARN_ON_ONCE(!(flush & REQ_FLUSH) && (flush & REQ_FUA)))
+ flush &= ~REQ_FUA;
+
+ q->flush_flags = flush & (REQ_FLUSH | REQ_FUA);
+}
+EXPORT_SYMBOL_GPL(blk_queue_flush);
+
static int __init blk_settings_init(void)
{
blk_max_low_pfn = max_low_pfn - 1;
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 47a4127..fa33f97 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -482,7 +482,6 @@ static struct brd_device *brd_alloc(int i)
if (!brd->brd_queue)
goto out_free_dev;
blk_queue_make_request(brd->brd_queue, brd_make_request);
- blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_DRAIN);
blk_queue_max_hw_sectors(brd->brd_queue, 1024);
blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c3a4a2e..953d1e1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -832,7 +832,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
lo->lo_queue->unplug_fn = loop_unplug;

if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
- blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN_FLUSH);
+ blk_queue_flush(lo->lo_queue, REQ_FLUSH);

set_capacity(lo->lo_disk, size);
bd_set_size(bdev, size << 9);
diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index 2284b4f..72d6246 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
@@ -439,7 +439,7 @@ static int osdblk_init_disk(struct osdblk_device *osdev)
blk_queue_stack_limits(q, osd_request_queue(osdev->osd));

blk_queue_prep_rq(q, blk_queue_start_tag);
- blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
+ blk_queue_flush(q, REQ_FLUSH);

disk->queue = q;

diff --git a/drivers/block/ps3disk.c b/drivers/block/ps3disk.c
index e9da874..4911f9e 100644
--- a/drivers/block/ps3disk.c
+++ b/drivers/block/ps3disk.c
@@ -468,7 +468,7 @@ static int __devinit ps3disk_probe(struct ps3_system_bus_device *_dev)
blk_queue_dma_alignment(queue, dev->blk_size-1);
blk_queue_logical_block_size(queue, dev->blk_size);

- blk_queue_ordered(queue, QUEUE_ORDERED_DRAIN_FLUSH);
+ blk_queue_flush(queue, REQ_FLUSH);

blk_queue_max_segments(queue, -1);
blk_queue_max_segment_size(queue, dev->bounce_size);
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 7965280..d10b635 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -388,22 +388,15 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
vblk->disk->driverfs_dev = &vdev->dev;
index++;

- if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH)) {
- /*
- * If the FLUSH feature is supported we do have support for
- * flushing a volatile write cache on the host. Use that
- * to implement write barrier support.
- */
- blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
- } else {
- /*
- * If the FLUSH feature is not supported we must assume that
- * the host does not perform any kind of volatile write
- * caching. We still need to drain the queue to provider
- * proper barrier semantics.
- */
- blk_queue_ordered(q, QUEUE_ORDERED_DRAIN);
- }
+ /*
+ * If the FLUSH feature is supported we do have support for
+ * flushing a volatile write cache on the host. Use that to
+ * implement write barrier support; otherwise, we must assume
+ * that the host does not perform any kind of volatile write
+ * caching.
+ */
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
+ blk_queue_flush(q, REQ_FLUSH);

/* If disk is read-only in the host, the guest should obey */
if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 25ffbf9..1d48f3a 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -95,7 +95,7 @@ struct blkfront_info
struct gnttab_free_callback callback;
struct blk_shadow shadow[BLK_RING_SIZE];
unsigned long shadow_free;
- int feature_barrier;
+ unsigned int feature_flush;
int is_ready;
};

@@ -418,25 +418,12 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
}


-static int xlvbd_barrier(struct blkfront_info *info)
+static void xlvbd_flush(struct blkfront_info *info)
{
- int err;
- const char *barrier;
-
- switch (info->feature_barrier) {
- case QUEUE_ORDERED_DRAIN: barrier = "enabled"; break;
- case QUEUE_ORDERED_NONE: barrier = "disabled"; break;
- default: return -EINVAL;
- }
-
- err = blk_queue_ordered(info->rq, info->feature_barrier);
-
- if (err)
- return err;
-
+ blk_queue_flush(info->rq, info->feature_flush);
printk(KERN_INFO "blkfront: %s: barriers %s\n",
- info->gd->disk_name, barrier);
- return 0;
+ info->gd->disk_name,
+ info->feature_flush ? "enabled" : "disabled");
}


@@ -515,7 +502,7 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
info->rq = gd->queue;
info->gd = gd;

- xlvbd_barrier(info);
+ xlvbd_flush(info);

if (vdisk_info & VDISK_READONLY)
set_disk_ro(gd, 1);
@@ -661,8 +648,8 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
printk(KERN_WARNING "blkfront: %s: write barrier op failed\n",
info->gd->disk_name);
error = -EOPNOTSUPP;
- info->feature_barrier = QUEUE_ORDERED_NONE;
- xlvbd_barrier(info);
+ info->feature_flush = 0;
+ xlvbd_flush(info);
}
/* fall through */
case BLKIF_OP_READ:
@@ -1075,19 +1062,13 @@ static void blkfront_connect(struct blkfront_info *info)
/*
* If there's no "feature-barrier" defined, then it means
* we're dealing with a very old backend which writes
- * synchronously; draining will do what needs to get done.
+ * synchronously; nothing to do.
*
* If there are barriers, then we use flush.
- *
- * If barriers are not supported, then there's no much we can
- * do, so just set ordering to NONE.
*/
- if (err)
- info->feature_barrier = QUEUE_ORDERED_DRAIN;
- else if (barrier)
- info->feature_barrier = QUEUE_ORDERED_DRAIN_FLUSH;
- else
- info->feature_barrier = QUEUE_ORDERED_NONE;
+ info->feature_flush = 0;
+ if (!err && barrier)
+ info->feature_flush = REQ_FLUSH;

err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
if (err) {
diff --git a/drivers/ide/ide-disk.c b/drivers/ide/ide-disk.c
index 7433e07..7c5b01c 100644
--- a/drivers/ide/ide-disk.c
+++ b/drivers/ide/ide-disk.c
@@ -516,10 +516,10 @@ static int ide_do_setfeature(ide_drive_t *drive, u8 feature, u8 nsect)
return ide_no_data_taskfile(drive, &cmd);
}

-static void update_ordered(ide_drive_t *drive)
+static void update_flush(ide_drive_t *drive)
{
u16 *id = drive->id;
- unsigned ordered = QUEUE_ORDERED_NONE;
+ unsigned flush = 0;

if (drive->dev_flags & IDE_DFLAG_WCACHE) {
unsigned long long capacity;
@@ -543,13 +543,12 @@ static void update_ordered(ide_drive_t *drive)
drive->name, barrier ? "" : "not ");

if (barrier) {
- ordered = QUEUE_ORDERED_DRAIN_FLUSH;
+ flush = REQ_FLUSH;
blk_queue_prep_rq(drive->queue, idedisk_prep_fn);
}
- } else
- ordered = QUEUE_ORDERED_DRAIN;
+ }

- blk_queue_ordered(drive->queue, ordered);
+ blk_queue_flush(drive->queue, flush);
}

ide_devset_get_flag(wcache, IDE_DFLAG_WCACHE);
@@ -572,7 +571,7 @@ static int set_wcache(ide_drive_t *drive, int arg)
}
}

- update_ordered(drive);
+ update_flush(drive);

return err;
}
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a3f21dc..b71cc9e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1908,7 +1908,7 @@ static struct mapped_device *alloc_dev(int minor)
blk_queue_softirq_done(md->queue, dm_softirq_done);
blk_queue_prep_rq(md->queue, dm_prep_fn);
blk_queue_lld_busy(md->queue, dm_lld_busy);
- blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH);
+ blk_queue_flush(md->queue, REQ_FLUSH);

md->disk = alloc_disk(1);
if (!md->disk)
diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index c77eb49..d791772 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -128,7 +128,6 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
mq->req = NULL;

blk_queue_prep_rq(mq->queue, mmc_prep_request);
- blk_queue_ordered(mq->queue, QUEUE_ORDERED_DRAIN);
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue);

#ifdef CONFIG_MMC_BLOCK_BOUNCE
diff --git a/drivers/s390/block/dasd.c b/drivers/s390/block/dasd.c
index 1a84fae..29046b7 100644
--- a/drivers/s390/block/dasd.c
+++ b/drivers/s390/block/dasd.c
@@ -2197,7 +2197,6 @@ static void dasd_setup_queue(struct dasd_block *block)
*/
blk_queue_max_segment_size(block->request_queue, PAGE_SIZE);
blk_queue_segment_boundary(block->request_queue, PAGE_SIZE - 1);
- blk_queue_ordered(block->request_queue, QUEUE_ORDERED_DRAIN);
}

/*
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 05a15b0..7f6aca2 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2109,7 +2109,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
struct scsi_disk *sdkp = scsi_disk(disk);
struct scsi_device *sdp = sdkp->device;
unsigned char *buffer;
- unsigned ordered;
+ unsigned flush = 0;

SCSI_LOG_HLQUEUE(3, sd_printk(KERN_INFO, sdkp,
"sd_revalidate_disk\n"));
@@ -2151,15 +2151,15 @@ static int sd_revalidate_disk(struct gendisk *disk)

/*
* We now have all cache related info, determine how we deal
- * with ordered requests.
+ * with flush requests.
*/
- if (sdkp->WCE)
- ordered = sdkp->DPOFUA
- ? QUEUE_ORDERED_DRAIN_FUA : QUEUE_ORDERED_DRAIN_FLUSH;
- else
- ordered = QUEUE_ORDERED_DRAIN;
+ if (sdkp->WCE) {
+ flush |= REQ_FLUSH;
+ if (sdkp->DPOFUA)
+ flush |= REQ_FUA;
+ }

- blk_queue_ordered(sdkp->disk->queue, ordered);
+ blk_queue_flush(sdkp->disk->queue, flush);

set_capacity(disk, sdkp->capacity);
kfree(buffer);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 96ef5f1..6003f7c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -355,8 +355,10 @@ struct request_queue
struct blk_trace *blk_trace;
#endif
/*
- * reserved for flush operations
+ * for flush operations
*/
+ unsigned int flush_flags;
+
unsigned int ordered, next_ordered, ordseq;
int orderr, ordcolor;
struct request pre_flush_rq, bar_rq, post_flush_rq;
@@ -863,8 +865,8 @@ extern void blk_queue_update_dma_alignment(struct request_queue *, int);
extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
+extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
-extern int blk_queue_ordered(struct request_queue *, unsigned);
extern bool blk_do_ordered(struct request_queue *, struct request **);
extern unsigned blk_ordered_cur_seq(struct request_queue *);
extern unsigned blk_ordered_req_seq(struct request *);
--
1.7.1

2010-08-12 12:44:38

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/11] fs, block: propagate REQ_FLUSH/FUA interface to upper layers

Propagate deprecation of REQ_HARDBARRIER and new REQ_FLUSH/FUA
interface to upper layers.

* WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
WRITE_FLUSH_FUA are added.

* REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
copied from bio to request.

* BH_Ordered is marked deprecated and BH_Flush and BH_FUA are added.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
fs/buffer.c | 27 ++++++++++++++++-----------
include/linux/blk_types.h | 2 +-
include/linux/buffer_head.h | 8 ++++++--
include/linux/fs.h | 20 +++++++++++++-------
4 files changed, 36 insertions(+), 21 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index d54812b..ec32fbb 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3019,18 +3019,23 @@ int submit_bh(int rw, struct buffer_head * bh)
BUG_ON(buffer_delay(bh));
BUG_ON(buffer_unwritten(bh));

- /*
- * Mask in barrier bit for a write (could be either a WRITE or a
- * WRITE_SYNC
- */
- if (buffer_ordered(bh) && (rw & WRITE))
- rw |= WRITE_BARRIER;
+ if (rw & WRITE) {
+ /* ordered is deprecated, will be removed */
+ if (buffer_ordered(bh))
+ rw |= WRITE_BARRIER;

- /*
- * Only clear out a write error when rewriting
- */
- if (test_set_buffer_req(bh) && (rw & WRITE))
- clear_buffer_write_io_error(bh);
+ if (buffer_flush(bh))
+ rw |= WRITE_FLUSH;
+
+ if (buffer_fua(bh))
+ rw |= WRITE_FUA;
+
+ /*
+ * Only clear out a write error when rewriting
+ */
+ if (test_set_buffer_req(bh))
+ clear_buffer_write_io_error(bh);
+ }

/*
* from here on down, it's all bio -- do the initial mapping,
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 8e9887d..6609fc0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -164,7 +164,7 @@ enum rq_flag_bits {
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
#define REQ_COMMON_MASK \
(REQ_WRITE | REQ_FAILFAST_MASK | REQ_HARDBARRIER | REQ_SYNC | \
- REQ_META| REQ_DISCARD | REQ_NOIDLE)
+ REQ_META | REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)

#define REQ_UNPLUG (1 << __REQ_UNPLUG)
#define REQ_RAHEAD (1 << __REQ_RAHEAD)
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 1b9ba19..498bd8b 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -32,8 +32,10 @@ enum bh_state_bits {
BH_Delay, /* Buffer is not yet allocated on disk */
BH_Boundary, /* Block is followed by a discontiguity */
BH_Write_EIO, /* I/O error on write */
- BH_Ordered, /* ordered write */
- BH_Eopnotsupp, /* operation not supported (barrier) */
+ BH_Ordered, /* DEPRECATED: ordered write */
+ BH_Eopnotsupp, /* DEPRECATED: operation not supported (barrier) */
+ BH_Flush, /* Flush device cache before executing IO */
+ BH_FUA, /* Data should be on non-volatile media on completion */
BH_Unwritten, /* Buffer is allocated on disk but not written */
BH_Quiet, /* Buffer Error Prinks to be quiet */

@@ -126,6 +128,8 @@ BUFFER_FNS(Delay, delay)
BUFFER_FNS(Boundary, boundary)
BUFFER_FNS(Write_EIO, write_io_error)
BUFFER_FNS(Ordered, ordered)
+BUFFER_FNS(Flush, flush)
+BUFFER_FNS(FUA, fua)
BUFFER_FNS(Eopnotsupp, eopnotsupp)
BUFFER_FNS(Unwritten, unwritten)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4ebd8eb..6e30b0b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -138,13 +138,13 @@ struct inodes_stat_t {
* SWRITE_SYNC
* SWRITE_SYNC_PLUG Like WRITE_SYNC/WRITE_SYNC_PLUG, but locks the buffer.
* See SWRITE.
- * WRITE_BARRIER Like WRITE_SYNC, but tells the block layer that all
- * previously submitted writes must be safely on storage
- * before this one is started. Also guarantees that when
- * this write is complete, it itself is also safely on
- * storage. Prevents reordering of writes on both sides
- * of this IO.
- *
+ * WRITE_BARRIER DEPRECATED. Always fails. Use FLUSH/FUA instead.
+ * WRITE_FLUSH Like WRITE_SYNC but with preceding cache flush.
+ * WRITE_FUA Like WRITE_SYNC but data is guaranteed to be on
+ * non-volatile media on completion.
+ * WRITE_FLUSH_FUA Combination of WRITE_FLUSH and FUA. The IO is preceded
+ * by a cache flush and data is guaranteed to be on
+ * non-volatile media on completion.
*/
#define RW_MASK REQ_WRITE
#define RWA_MASK REQ_RAHEAD
@@ -162,6 +162,12 @@ struct inodes_stat_t {
#define WRITE_META (WRITE | REQ_META)
#define WRITE_BARRIER (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
REQ_HARDBARRIER)
+#define WRITE_FLUSH (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
+ REQ_FLUSH)
+#define WRITE_FUA (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
+ REQ_FUA)
+#define WRITE_FLUSH_FUA (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
+ REQ_FLUSH | REQ_FUA)
#define SWRITE_SYNC_PLUG (SWRITE | REQ_SYNC | REQ_NOIDLE)
#define SWRITE_SYNC (SWRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG)

--
1.7.1

2010-08-12 12:44:27

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/11] block: rename barrier/ordered to flush

With ordering requirements dropped, barrier and ordered are misnomers.
Now all block layer does is sequencing FLUSH and FUA. Rename them to
flush.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
block/blk-core.c | 21 +++++-----
block/blk-flush.c | 98 +++++++++++++++++++++++------------------------
block/blk.h | 4 +-
include/linux/blkdev.h | 26 ++++++------
4 files changed, 73 insertions(+), 76 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 82bd6d9..efe391b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -136,7 +136,7 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
{
struct request_queue *q = rq->q;

- if (&q->bar_rq != rq) {
+ if (&q->flush_rq != rq) {
if (error)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
@@ -160,13 +160,12 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
if (bio->bi_size == 0)
bio_endio(bio, error);
} else {
-
/*
- * Okay, this is the barrier request in progress, just
- * record the error;
+ * Okay, this is the sequenced flush request in
+ * progress, just record the error;
*/
- if (error && !q->orderr)
- q->orderr = error;
+ if (error && !q->flush_err)
+ q->flush_err = error;
}
}

@@ -520,7 +519,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
init_timer(&q->unplug_timer);
setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
INIT_LIST_HEAD(&q->timeout_list);
- INIT_LIST_HEAD(&q->pending_barriers);
+ INIT_LIST_HEAD(&q->pending_flushes);
INIT_WORK(&q->unplug_work, blk_unplug_work);

kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1758,11 +1757,11 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
static void blk_account_io_done(struct request *req)
{
/*
- * Account IO completion. bar_rq isn't accounted as a normal
- * IO on queueing nor completion. Accounting the containing
- * request is enough.
+ * Account IO completion. flush_rq isn't accounted as a
+ * normal IO on queueing nor completion. Accounting the
+ * containing request is enough.
*/
- if (blk_do_io_stat(req) && req != &req->q->bar_rq) {
+ if (blk_do_io_stat(req) && req != &req->q->flush_rq) {
unsigned long duration = jiffies - req->start_time;
const int rw = rq_data_dir(req);
struct hd_struct *part;
diff --git a/block/blk-flush.c b/block/blk-flush.c
index e8b2e5c..dd87322 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -9,41 +9,38 @@

#include "blk.h"

-static struct request *queue_next_ordseq(struct request_queue *q);
+static struct request *queue_next_fseq(struct request_queue *q);

-/*
- * Cache flushing for ordered writes handling
- */
-unsigned blk_ordered_cur_seq(struct request_queue *q)
+unsigned blk_flush_cur_seq(struct request_queue *q)
{
- if (!q->ordseq)
+ if (!q->flush_seq)
return 0;
- return 1 << ffz(q->ordseq);
+ return 1 << ffz(q->flush_seq);
}

-static struct request *blk_ordered_complete_seq(struct request_queue *q,
- unsigned seq, int error)
+static struct request *blk_flush_complete_seq(struct request_queue *q,
+ unsigned seq, int error)
{
struct request *next_rq = NULL;

- if (error && !q->orderr)
- q->orderr = error;
+ if (error && !q->flush_err)
+ q->flush_err = error;

- BUG_ON(q->ordseq & seq);
- q->ordseq |= seq;
+ BUG_ON(q->flush_seq & seq);
+ q->flush_seq |= seq;

- if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
- /* not complete yet, queue the next ordered sequence */
- next_rq = queue_next_ordseq(q);
+ if (blk_flush_cur_seq(q) != QUEUE_FSEQ_DONE) {
+ /* not complete yet, queue the next flush sequence */
+ next_rq = queue_next_fseq(q);
} else {
- /* complete this barrier request */
- __blk_end_request_all(q->orig_bar_rq, q->orderr);
- q->orig_bar_rq = NULL;
- q->ordseq = 0;
-
- /* dispatch the next barrier if there's one */
- if (!list_empty(&q->pending_barriers)) {
- next_rq = list_entry_rq(q->pending_barriers.next);
+ /* complete this flush request */
+ __blk_end_request_all(q->orig_flush_rq, q->flush_err);
+ q->orig_flush_rq = NULL;
+ q->flush_seq = 0;
+
+ /* dispatch the next flush if there's one */
+ if (!list_empty(&q->pending_flushes)) {
+ next_rq = list_entry_rq(q->pending_flushes.next);
list_move(&next_rq->queuelist, &q->queue_head);
}
}
@@ -53,19 +50,19 @@ static struct request *blk_ordered_complete_seq(struct request_queue *q,
static void pre_flush_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_PREFLUSH, error);
+ blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error);
}

-static void bar_end_io(struct request *rq, int error)
+static void flush_data_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_BAR, error);
+ blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error);
}

static void post_flush_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
+ blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
}

static void queue_flush(struct request_queue *q, struct request *rq,
@@ -74,34 +71,34 @@ static void queue_flush(struct request_queue *q, struct request *rq,
blk_rq_init(q, rq);
rq->cmd_type = REQ_TYPE_FS;
rq->cmd_flags = REQ_FLUSH;
- rq->rq_disk = q->orig_bar_rq->rq_disk;
+ rq->rq_disk = q->orig_flush_rq->rq_disk;
rq->end_io = end_io;

elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
}

-static struct request *queue_next_ordseq(struct request_queue *q)
+static struct request *queue_next_fseq(struct request_queue *q)
{
- struct request *rq = &q->bar_rq;
+ struct request *rq = &q->flush_rq;

- switch (blk_ordered_cur_seq(q)) {
- case QUEUE_ORDSEQ_PREFLUSH:
+ switch (blk_flush_cur_seq(q)) {
+ case QUEUE_FSEQ_PREFLUSH:
queue_flush(q, rq, pre_flush_end_io);
break;

- case QUEUE_ORDSEQ_BAR:
+ case QUEUE_FSEQ_DATA:
/* initialize proxy request and queue it */
blk_rq_init(q, rq);
- init_request_from_bio(rq, q->orig_bar_rq->bio);
+ init_request_from_bio(rq, q->orig_flush_rq->bio);
rq->cmd_flags &= ~REQ_HARDBARRIER;
if (q->ordered & QUEUE_ORDERED_DO_FUA)
rq->cmd_flags |= REQ_FUA;
- rq->end_io = bar_end_io;
+ rq->end_io = flush_data_end_io;

elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
break;

- case QUEUE_ORDSEQ_POSTFLUSH:
+ case QUEUE_FSEQ_POSTFLUSH:
queue_flush(q, rq, post_flush_end_io);
break;

@@ -111,19 +108,20 @@ static struct request *queue_next_ordseq(struct request_queue *q)
return rq;
}

-struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
+struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned skip = 0;

if (!(rq->cmd_flags & REQ_HARDBARRIER))
return rq;

- if (q->ordseq) {
+ if (q->flush_seq) {
/*
- * Barrier is already in progress and they can't be
- * processed in parallel. Queue for later processing.
+ * Sequenced flush is already in progress and they
+ * can't be processed in parallel. Queue for later
+ * processing.
*/
- list_move_tail(&rq->queuelist, &q->pending_barriers);
+ list_move_tail(&rq->queuelist, &q->pending_flushes);
return NULL;
}

@@ -138,11 +136,11 @@ struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
}

/*
- * Start a new ordered sequence
+ * Start a new flush sequence
*/
- q->orderr = 0;
+ q->flush_err = 0;
q->ordered = q->next_ordered;
- q->ordseq |= QUEUE_ORDSEQ_STARTED;
+ q->flush_seq |= QUEUE_FSEQ_STARTED;

/*
* For an empty barrier, there's no actual BAR request, which
@@ -154,19 +152,19 @@ struct request *blk_do_ordered(struct request_queue *q, struct request *rq)

/* stash away the original request */
blk_dequeue_request(rq);
- q->orig_bar_rq = rq;
+ q->orig_flush_rq = rq;

if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
- skip |= QUEUE_ORDSEQ_PREFLUSH;
+ skip |= QUEUE_FSEQ_PREFLUSH;

if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
- skip |= QUEUE_ORDSEQ_BAR;
+ skip |= QUEUE_FSEQ_DATA;

if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
- skip |= QUEUE_ORDSEQ_POSTFLUSH;
+ skip |= QUEUE_FSEQ_POSTFLUSH;

/* complete skipped sequences and return the first sequence */
- return blk_ordered_complete_seq(q, skip, 0);
+ return blk_flush_complete_seq(q, skip, 0);
}

static void bio_end_empty_barrier(struct bio *bio, int err)
diff --git a/block/blk.h b/block/blk.h
index 08081e4..24b92bd 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -51,7 +51,7 @@ static inline void blk_clear_rq_complete(struct request *rq)
*/
#define ELV_ON_HASH(rq) (!hlist_unhashed(&(rq)->hash))

-struct request *blk_do_ordered(struct request_queue *q, struct request *rq);
+struct request *blk_do_flush(struct request_queue *q, struct request *rq);

static inline struct request *__elv_next_request(struct request_queue *q)
{
@@ -60,7 +60,7 @@ static inline struct request *__elv_next_request(struct request_queue *q)
while (1) {
while (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- rq = blk_do_ordered(q, rq);
+ rq = blk_do_flush(q, rq);
if (rq)
return rq;
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 522ecda..87e58f0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -357,13 +357,13 @@ struct request_queue
/*
* for flush operations
*/
+ unsigned int ordered, next_ordered;
unsigned int flush_flags;
-
- unsigned int ordered, next_ordered, ordseq;
- int orderr;
- struct request bar_rq;
- struct request *orig_bar_rq;
- struct list_head pending_barriers;
+ unsigned int flush_seq;
+ int flush_err;
+ struct request flush_rq;
+ struct request *orig_flush_rq;
+ struct list_head pending_flushes;

struct mutex sysfs_lock;

@@ -489,13 +489,13 @@ enum {
QUEUE_ORDERED_DO_FUA,

/*
- * Ordered operation sequence
+ * FLUSH/FUA sequences.
*/
- QUEUE_ORDSEQ_STARTED = (1 << 0), /* flushing in progress */
- QUEUE_ORDSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */
- QUEUE_ORDSEQ_BAR = (1 << 2), /* barrier write in progress */
- QUEUE_ORDSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */
- QUEUE_ORDSEQ_DONE = (1 << 4),
+ QUEUE_FSEQ_STARTED = (1 << 0), /* flushing in progress */
+ QUEUE_FSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */
+ QUEUE_FSEQ_DATA = (1 << 2), /* data write in progress */
+ QUEUE_FSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */
+ QUEUE_FSEQ_DONE = (1 << 4),
};

#define blk_queue_plugged(q) test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
@@ -507,7 +507,7 @@ enum {
#define blk_queue_nonrot(q) test_bit(QUEUE_FLAG_NONROT, &(q)->queue_flags)
#define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
#define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
-#define blk_queue_flushing(q) ((q)->ordseq)
+#define blk_queue_flushing(q) ((q)->flush_seq)
#define blk_queue_stackable(q) \
test_bit(QUEUE_FLAG_STACKABLE, &(q)->queue_flags)
#define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
--
1.7.1

2010-08-12 12:46:15

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/11] block: kill QUEUE_ORDERED_BY_TAG

Nobody is making meaningful use of ORDERED_BY_TAG now and queue
draining for barrier requests will be removed soon which will render
the advantage of tag ordering moot. Kill ORDERED_BY_TAG. The
following users are affected.

* brd: converted to ORDERED_DRAIN.
* virtio_blk: ORDERED_TAG path was already marked deprecated. Removed.
* xen-blkfront: ORDERED_TAG case dropped.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
Cc: Chris Wright <[email protected]>
---
block/blk-barrier.c | 35 +++++++----------------------------
drivers/block/brd.c | 2 +-
drivers/block/virtio_blk.c | 9 ---------
drivers/block/xen-blkfront.c | 8 +++-----
drivers/scsi/sd.c | 4 +---
include/linux/blkdev.h | 17 +----------------
6 files changed, 13 insertions(+), 62 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index f0faefc..c807e9c 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -26,10 +26,7 @@ int blk_queue_ordered(struct request_queue *q, unsigned ordered)
if (ordered != QUEUE_ORDERED_NONE &&
ordered != QUEUE_ORDERED_DRAIN &&
ordered != QUEUE_ORDERED_DRAIN_FLUSH &&
- ordered != QUEUE_ORDERED_DRAIN_FUA &&
- ordered != QUEUE_ORDERED_TAG &&
- ordered != QUEUE_ORDERED_TAG_FLUSH &&
- ordered != QUEUE_ORDERED_TAG_FUA) {
+ ordered != QUEUE_ORDERED_DRAIN_FUA) {
printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
return -EINVAL;
}
@@ -155,21 +152,9 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)
* For an empty barrier, there's no actual BAR request, which
* in turn makes POSTFLUSH unnecessary. Mask them off.
*/
- if (!blk_rq_sectors(rq)) {
+ if (!blk_rq_sectors(rq))
q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
QUEUE_ORDERED_DO_POSTFLUSH);
- /*
- * Empty barrier on a write-through device w/ ordered
- * tag has no command to issue and without any command
- * to issue, ordering by tag can't be used. Drain
- * instead.
- */
- if ((q->ordered & QUEUE_ORDERED_BY_TAG) &&
- !(q->ordered & QUEUE_ORDERED_DO_PREFLUSH)) {
- q->ordered &= ~QUEUE_ORDERED_BY_TAG;
- q->ordered |= QUEUE_ORDERED_BY_DRAIN;
- }
- }

/* stash away the original request */
blk_dequeue_request(rq);
@@ -210,7 +195,7 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)
} else
skip |= QUEUE_ORDSEQ_PREFLUSH;

- if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q))
+ if (queue_in_flight(q))
rq = NULL;
else
skip |= QUEUE_ORDSEQ_DRAIN;
@@ -257,16 +242,10 @@ bool blk_do_ordered(struct request_queue *q, struct request **rqp)
rq != &q->pre_flush_rq && rq != &q->post_flush_rq)
return true;

- if (q->ordered & QUEUE_ORDERED_BY_TAG) {
- /* Ordered by tag. Blocking the next barrier is enough. */
- if (is_barrier && rq != &q->bar_rq)
- *rqp = NULL;
- } else {
- /* Ordered by draining. Wait for turn. */
- WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
- if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
- *rqp = NULL;
- }
+ /* Ordered by draining. Wait for turn. */
+ WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
+ if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
+ *rqp = NULL;

return true;
}
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 1c7f637..47a4127 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -482,7 +482,7 @@ static struct brd_device *brd_alloc(int i)
if (!brd->brd_queue)
goto out_free_dev;
blk_queue_make_request(brd->brd_queue, brd_make_request);
- blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_TAG);
+ blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_DRAIN);
blk_queue_max_hw_sectors(brd->brd_queue, 1024);
blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 2aafafc..7965280 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -395,15 +395,6 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
* to implement write barrier support.
*/
blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
- } else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER)) {
- /*
- * If the BARRIER feature is supported the host expects us
- * to order request by tags. This implies there is not
- * volatile write cache on the host, and that the host
- * never re-orders outstanding I/O. This feature is not
- * useful for real life scenarious and deprecated.
- */
- blk_queue_ordered(q, QUEUE_ORDERED_TAG);
} else {
/*
* If the FLUSH feature is not supported we must assume that
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 510ab86..25ffbf9 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -424,8 +424,7 @@ static int xlvbd_barrier(struct blkfront_info *info)
const char *barrier;

switch (info->feature_barrier) {
- case QUEUE_ORDERED_DRAIN: barrier = "enabled (drain)"; break;
- case QUEUE_ORDERED_TAG: barrier = "enabled (tag)"; break;
+ case QUEUE_ORDERED_DRAIN: barrier = "enabled"; break;
case QUEUE_ORDERED_NONE: barrier = "disabled"; break;
default: return -EINVAL;
}
@@ -1078,8 +1077,7 @@ static void blkfront_connect(struct blkfront_info *info)
* we're dealing with a very old backend which writes
* synchronously; draining will do what needs to get done.
*
- * If there are barriers, then we can do full queued writes
- * with tagged barriers.
+ * If there are barriers, then we use flush.
*
* If barriers are not supported, then there's no much we can
* do, so just set ordering to NONE.
@@ -1087,7 +1085,7 @@ static void blkfront_connect(struct blkfront_info *info)
if (err)
info->feature_barrier = QUEUE_ORDERED_DRAIN;
else if (barrier)
- info->feature_barrier = QUEUE_ORDERED_TAG;
+ info->feature_barrier = QUEUE_ORDERED_DRAIN_FLUSH;
else
info->feature_barrier = QUEUE_ORDERED_NONE;

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 8e2e893..05a15b0 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2151,9 +2151,7 @@ static int sd_revalidate_disk(struct gendisk *disk)

/*
* We now have all cache related info, determine how we deal
- * with ordered requests. Note that as the current SCSI
- * dispatch function can alter request order, we cannot use
- * QUEUE_ORDERED_TAG_* even when ordered tag is supported.
+ * with ordered requests.
*/
if (sdkp->WCE)
ordered = sdkp->DPOFUA
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 89c855c..96ef5f1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -469,12 +469,7 @@ enum {
* DRAIN : ordering by draining is enough
* DRAIN_FLUSH : ordering by draining w/ pre and post flushes
* DRAIN_FUA : ordering by draining w/ pre flush and FUA write
- * TAG : ordering by tag is enough
- * TAG_FLUSH : ordering by tag w/ pre and post flushes
- * TAG_FUA : ordering by tag w/ pre flush and FUA write
*/
- QUEUE_ORDERED_BY_DRAIN = 0x01,
- QUEUE_ORDERED_BY_TAG = 0x02,
QUEUE_ORDERED_DO_PREFLUSH = 0x10,
QUEUE_ORDERED_DO_BAR = 0x20,
QUEUE_ORDERED_DO_POSTFLUSH = 0x40,
@@ -482,8 +477,7 @@ enum {

QUEUE_ORDERED_NONE = 0x00,

- QUEUE_ORDERED_DRAIN = QUEUE_ORDERED_BY_DRAIN |
- QUEUE_ORDERED_DO_BAR,
+ QUEUE_ORDERED_DRAIN = QUEUE_ORDERED_DO_BAR,
QUEUE_ORDERED_DRAIN_FLUSH = QUEUE_ORDERED_DRAIN |
QUEUE_ORDERED_DO_PREFLUSH |
QUEUE_ORDERED_DO_POSTFLUSH,
@@ -491,15 +485,6 @@ enum {
QUEUE_ORDERED_DO_PREFLUSH |
QUEUE_ORDERED_DO_FUA,

- QUEUE_ORDERED_TAG = QUEUE_ORDERED_BY_TAG |
- QUEUE_ORDERED_DO_BAR,
- QUEUE_ORDERED_TAG_FLUSH = QUEUE_ORDERED_TAG |
- QUEUE_ORDERED_DO_PREFLUSH |
- QUEUE_ORDERED_DO_POSTFLUSH,
- QUEUE_ORDERED_TAG_FUA = QUEUE_ORDERED_TAG |
- QUEUE_ORDERED_DO_PREFLUSH |
- QUEUE_ORDERED_DO_FUA,
-
/*
* Ordered operation sequence
*/
--
1.7.1

2010-08-12 12:44:23

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/11] block: rename blk-barrier.c to blk-flush.c

Without ordering requirements, barrier and ordering are minomers.
Rename block/blk-barrier.c to block/blk-flush.c. Rename of symbols
will follow.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
block/Makefile | 2 +-
block/blk-barrier.c | 248 ---------------------------------------------------
block/blk-flush.c | 248 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 249 insertions(+), 249 deletions(-)
delete mode 100644 block/blk-barrier.c
create mode 100644 block/blk-flush.c

diff --git a/block/Makefile b/block/Makefile
index 0bb499a..f627e4b 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -3,7 +3,7 @@
#

obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
- blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
+ blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-iopoll.o blk-lib.o ioctl.o genhd.o scsi_ioctl.o

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
deleted file mode 100644
index e8b2e5c..0000000
--- a/block/blk-barrier.c
+++ /dev/null
@@ -1,248 +0,0 @@
-/*
- * Functions related to barrier IO handling
- */
-#include <linux/kernel.h>
-#include <linux/module.h>
-#include <linux/bio.h>
-#include <linux/blkdev.h>
-#include <linux/gfp.h>
-
-#include "blk.h"
-
-static struct request *queue_next_ordseq(struct request_queue *q);
-
-/*
- * Cache flushing for ordered writes handling
- */
-unsigned blk_ordered_cur_seq(struct request_queue *q)
-{
- if (!q->ordseq)
- return 0;
- return 1 << ffz(q->ordseq);
-}
-
-static struct request *blk_ordered_complete_seq(struct request_queue *q,
- unsigned seq, int error)
-{
- struct request *next_rq = NULL;
-
- if (error && !q->orderr)
- q->orderr = error;
-
- BUG_ON(q->ordseq & seq);
- q->ordseq |= seq;
-
- if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
- /* not complete yet, queue the next ordered sequence */
- next_rq = queue_next_ordseq(q);
- } else {
- /* complete this barrier request */
- __blk_end_request_all(q->orig_bar_rq, q->orderr);
- q->orig_bar_rq = NULL;
- q->ordseq = 0;
-
- /* dispatch the next barrier if there's one */
- if (!list_empty(&q->pending_barriers)) {
- next_rq = list_entry_rq(q->pending_barriers.next);
- list_move(&next_rq->queuelist, &q->queue_head);
- }
- }
- return next_rq;
-}
-
-static void pre_flush_end_io(struct request *rq, int error)
-{
- elv_completed_request(rq->q, rq);
- blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_PREFLUSH, error);
-}
-
-static void bar_end_io(struct request *rq, int error)
-{
- elv_completed_request(rq->q, rq);
- blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_BAR, error);
-}
-
-static void post_flush_end_io(struct request *rq, int error)
-{
- elv_completed_request(rq->q, rq);
- blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
-}
-
-static void queue_flush(struct request_queue *q, struct request *rq,
- rq_end_io_fn *end_io)
-{
- blk_rq_init(q, rq);
- rq->cmd_type = REQ_TYPE_FS;
- rq->cmd_flags = REQ_FLUSH;
- rq->rq_disk = q->orig_bar_rq->rq_disk;
- rq->end_io = end_io;
-
- elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
-}
-
-static struct request *queue_next_ordseq(struct request_queue *q)
-{
- struct request *rq = &q->bar_rq;
-
- switch (blk_ordered_cur_seq(q)) {
- case QUEUE_ORDSEQ_PREFLUSH:
- queue_flush(q, rq, pre_flush_end_io);
- break;
-
- case QUEUE_ORDSEQ_BAR:
- /* initialize proxy request and queue it */
- blk_rq_init(q, rq);
- init_request_from_bio(rq, q->orig_bar_rq->bio);
- rq->cmd_flags &= ~REQ_HARDBARRIER;
- if (q->ordered & QUEUE_ORDERED_DO_FUA)
- rq->cmd_flags |= REQ_FUA;
- rq->end_io = bar_end_io;
-
- elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
- break;
-
- case QUEUE_ORDSEQ_POSTFLUSH:
- queue_flush(q, rq, post_flush_end_io);
- break;
-
- default:
- BUG();
- }
- return rq;
-}
-
-struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
-{
- unsigned skip = 0;
-
- if (!(rq->cmd_flags & REQ_HARDBARRIER))
- return rq;
-
- if (q->ordseq) {
- /*
- * Barrier is already in progress and they can't be
- * processed in parallel. Queue for later processing.
- */
- list_move_tail(&rq->queuelist, &q->pending_barriers);
- return NULL;
- }
-
- if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) {
- /*
- * Queue ordering not supported. Terminate
- * with prejudice.
- */
- blk_dequeue_request(rq);
- __blk_end_request_all(rq, -EOPNOTSUPP);
- return NULL;
- }
-
- /*
- * Start a new ordered sequence
- */
- q->orderr = 0;
- q->ordered = q->next_ordered;
- q->ordseq |= QUEUE_ORDSEQ_STARTED;
-
- /*
- * For an empty barrier, there's no actual BAR request, which
- * in turn makes POSTFLUSH unnecessary. Mask them off.
- */
- if (!blk_rq_sectors(rq))
- q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
- QUEUE_ORDERED_DO_POSTFLUSH);
-
- /* stash away the original request */
- blk_dequeue_request(rq);
- q->orig_bar_rq = rq;
-
- if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
- skip |= QUEUE_ORDSEQ_PREFLUSH;
-
- if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
- skip |= QUEUE_ORDSEQ_BAR;
-
- if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
- skip |= QUEUE_ORDSEQ_POSTFLUSH;
-
- /* complete skipped sequences and return the first sequence */
- return blk_ordered_complete_seq(q, skip, 0);
-}
-
-static void bio_end_empty_barrier(struct bio *bio, int err)
-{
- if (err) {
- if (err == -EOPNOTSUPP)
- set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
- clear_bit(BIO_UPTODATE, &bio->bi_flags);
- }
- if (bio->bi_private)
- complete(bio->bi_private);
- bio_put(bio);
-}
-
-/**
- * blkdev_issue_flush - queue a flush
- * @bdev: blockdev to issue flush for
- * @gfp_mask: memory allocation flags (for bio_alloc)
- * @error_sector: error sector
- * @flags: BLKDEV_IFL_* flags to control behaviour
- *
- * Description:
- * Issue a flush for the block device in question. Caller can supply
- * room for storing the error offset in case of a flush error, if they
- * wish to. If WAIT flag is not passed then caller may check only what
- * request was pushed in some internal queue for later handling.
- */
-int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
- sector_t *error_sector, unsigned long flags)
-{
- DECLARE_COMPLETION_ONSTACK(wait);
- struct request_queue *q;
- struct bio *bio;
- int ret = 0;
-
- if (bdev->bd_disk == NULL)
- return -ENXIO;
-
- q = bdev_get_queue(bdev);
- if (!q)
- return -ENXIO;
-
- /*
- * some block devices may not have their queue correctly set up here
- * (e.g. loop device without a backing file) and so issuing a flush
- * here will panic. Ensure there is a request function before issuing
- * the barrier.
- */
- if (!q->make_request_fn)
- return -ENXIO;
-
- bio = bio_alloc(gfp_mask, 0);
- bio->bi_end_io = bio_end_empty_barrier;
- bio->bi_bdev = bdev;
- if (test_bit(BLKDEV_WAIT, &flags))
- bio->bi_private = &wait;
-
- bio_get(bio);
- submit_bio(WRITE_BARRIER, bio);
- if (test_bit(BLKDEV_WAIT, &flags)) {
- wait_for_completion(&wait);
- /*
- * The driver must store the error location in ->bi_sector, if
- * it supports it. For non-stacked drivers, this should be
- * copied from blk_rq_pos(rq).
- */
- if (error_sector)
- *error_sector = bio->bi_sector;
- }
-
- if (bio_flagged(bio, BIO_EOPNOTSUPP))
- ret = -EOPNOTSUPP;
- else if (!bio_flagged(bio, BIO_UPTODATE))
- ret = -EIO;
-
- bio_put(bio);
- return ret;
-}
-EXPORT_SYMBOL(blkdev_issue_flush);
diff --git a/block/blk-flush.c b/block/blk-flush.c
new file mode 100644
index 0000000..e8b2e5c
--- /dev/null
+++ b/block/blk-flush.c
@@ -0,0 +1,248 @@
+/*
+ * Functions related to barrier IO handling
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/gfp.h>
+
+#include "blk.h"
+
+static struct request *queue_next_ordseq(struct request_queue *q);
+
+/*
+ * Cache flushing for ordered writes handling
+ */
+unsigned blk_ordered_cur_seq(struct request_queue *q)
+{
+ if (!q->ordseq)
+ return 0;
+ return 1 << ffz(q->ordseq);
+}
+
+static struct request *blk_ordered_complete_seq(struct request_queue *q,
+ unsigned seq, int error)
+{
+ struct request *next_rq = NULL;
+
+ if (error && !q->orderr)
+ q->orderr = error;
+
+ BUG_ON(q->ordseq & seq);
+ q->ordseq |= seq;
+
+ if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
+ /* not complete yet, queue the next ordered sequence */
+ next_rq = queue_next_ordseq(q);
+ } else {
+ /* complete this barrier request */
+ __blk_end_request_all(q->orig_bar_rq, q->orderr);
+ q->orig_bar_rq = NULL;
+ q->ordseq = 0;
+
+ /* dispatch the next barrier if there's one */
+ if (!list_empty(&q->pending_barriers)) {
+ next_rq = list_entry_rq(q->pending_barriers.next);
+ list_move(&next_rq->queuelist, &q->queue_head);
+ }
+ }
+ return next_rq;
+}
+
+static void pre_flush_end_io(struct request *rq, int error)
+{
+ elv_completed_request(rq->q, rq);
+ blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_PREFLUSH, error);
+}
+
+static void bar_end_io(struct request *rq, int error)
+{
+ elv_completed_request(rq->q, rq);
+ blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_BAR, error);
+}
+
+static void post_flush_end_io(struct request *rq, int error)
+{
+ elv_completed_request(rq->q, rq);
+ blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
+}
+
+static void queue_flush(struct request_queue *q, struct request *rq,
+ rq_end_io_fn *end_io)
+{
+ blk_rq_init(q, rq);
+ rq->cmd_type = REQ_TYPE_FS;
+ rq->cmd_flags = REQ_FLUSH;
+ rq->rq_disk = q->orig_bar_rq->rq_disk;
+ rq->end_io = end_io;
+
+ elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
+}
+
+static struct request *queue_next_ordseq(struct request_queue *q)
+{
+ struct request *rq = &q->bar_rq;
+
+ switch (blk_ordered_cur_seq(q)) {
+ case QUEUE_ORDSEQ_PREFLUSH:
+ queue_flush(q, rq, pre_flush_end_io);
+ break;
+
+ case QUEUE_ORDSEQ_BAR:
+ /* initialize proxy request and queue it */
+ blk_rq_init(q, rq);
+ init_request_from_bio(rq, q->orig_bar_rq->bio);
+ rq->cmd_flags &= ~REQ_HARDBARRIER;
+ if (q->ordered & QUEUE_ORDERED_DO_FUA)
+ rq->cmd_flags |= REQ_FUA;
+ rq->end_io = bar_end_io;
+
+ elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
+ break;
+
+ case QUEUE_ORDSEQ_POSTFLUSH:
+ queue_flush(q, rq, post_flush_end_io);
+ break;
+
+ default:
+ BUG();
+ }
+ return rq;
+}
+
+struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
+{
+ unsigned skip = 0;
+
+ if (!(rq->cmd_flags & REQ_HARDBARRIER))
+ return rq;
+
+ if (q->ordseq) {
+ /*
+ * Barrier is already in progress and they can't be
+ * processed in parallel. Queue for later processing.
+ */
+ list_move_tail(&rq->queuelist, &q->pending_barriers);
+ return NULL;
+ }
+
+ if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) {
+ /*
+ * Queue ordering not supported. Terminate
+ * with prejudice.
+ */
+ blk_dequeue_request(rq);
+ __blk_end_request_all(rq, -EOPNOTSUPP);
+ return NULL;
+ }
+
+ /*
+ * Start a new ordered sequence
+ */
+ q->orderr = 0;
+ q->ordered = q->next_ordered;
+ q->ordseq |= QUEUE_ORDSEQ_STARTED;
+
+ /*
+ * For an empty barrier, there's no actual BAR request, which
+ * in turn makes POSTFLUSH unnecessary. Mask them off.
+ */
+ if (!blk_rq_sectors(rq))
+ q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
+ QUEUE_ORDERED_DO_POSTFLUSH);
+
+ /* stash away the original request */
+ blk_dequeue_request(rq);
+ q->orig_bar_rq = rq;
+
+ if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
+ skip |= QUEUE_ORDSEQ_PREFLUSH;
+
+ if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
+ skip |= QUEUE_ORDSEQ_BAR;
+
+ if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
+ skip |= QUEUE_ORDSEQ_POSTFLUSH;
+
+ /* complete skipped sequences and return the first sequence */
+ return blk_ordered_complete_seq(q, skip, 0);
+}
+
+static void bio_end_empty_barrier(struct bio *bio, int err)
+{
+ if (err) {
+ if (err == -EOPNOTSUPP)
+ set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
+ clear_bit(BIO_UPTODATE, &bio->bi_flags);
+ }
+ if (bio->bi_private)
+ complete(bio->bi_private);
+ bio_put(bio);
+}
+
+/**
+ * blkdev_issue_flush - queue a flush
+ * @bdev: blockdev to issue flush for
+ * @gfp_mask: memory allocation flags (for bio_alloc)
+ * @error_sector: error sector
+ * @flags: BLKDEV_IFL_* flags to control behaviour
+ *
+ * Description:
+ * Issue a flush for the block device in question. Caller can supply
+ * room for storing the error offset in case of a flush error, if they
+ * wish to. If WAIT flag is not passed then caller may check only what
+ * request was pushed in some internal queue for later handling.
+ */
+int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
+ sector_t *error_sector, unsigned long flags)
+{
+ DECLARE_COMPLETION_ONSTACK(wait);
+ struct request_queue *q;
+ struct bio *bio;
+ int ret = 0;
+
+ if (bdev->bd_disk == NULL)
+ return -ENXIO;
+
+ q = bdev_get_queue(bdev);
+ if (!q)
+ return -ENXIO;
+
+ /*
+ * some block devices may not have their queue correctly set up here
+ * (e.g. loop device without a backing file) and so issuing a flush
+ * here will panic. Ensure there is a request function before issuing
+ * the barrier.
+ */
+ if (!q->make_request_fn)
+ return -ENXIO;
+
+ bio = bio_alloc(gfp_mask, 0);
+ bio->bi_end_io = bio_end_empty_barrier;
+ bio->bi_bdev = bdev;
+ if (test_bit(BLKDEV_WAIT, &flags))
+ bio->bi_private = &wait;
+
+ bio_get(bio);
+ submit_bio(WRITE_BARRIER, bio);
+ if (test_bit(BLKDEV_WAIT, &flags)) {
+ wait_for_completion(&wait);
+ /*
+ * The driver must store the error location in ->bi_sector, if
+ * it supports it. For non-stacked drivers, this should be
+ * copied from blk_rq_pos(rq).
+ */
+ if (error_sector)
+ *error_sector = bio->bi_sector;
+ }
+
+ if (bio_flagged(bio, BIO_EOPNOTSUPP))
+ ret = -EOPNOTSUPP;
+ else if (!bio_flagged(bio, BIO_UPTODATE))
+ ret = -EIO;
+
+ bio_put(bio);
+ return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_flush);
--
1.7.1

2010-08-12 12:47:13

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 01/11] block/loop: queue ordered mode should be DRAIN_FLUSH

loop implements FLUSH using fsync but was incorrectly setting its
ordered mode to DRAIN. Change it to DRAIN_FLUSH. In practice, this
doesn't change anything as loop doesn't make use of the block layer
ordered implementation.

Signed-off-by: Tejun Heo <[email protected]>
---
drivers/block/loop.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f3c636d..c3a4a2e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -832,7 +832,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
lo->lo_queue->unplug_fn = loop_unplug;

if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
- blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN);
+ blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN_FLUSH);

set_capacity(lo->lo_disk, size);
bd_set_size(bdev, size << 9);
--
1.7.1

2010-08-12 12:47:19

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/11] block: implement REQ_FLUSH/FUA based interface for FLUSH/FUA requests

Now that the backend conversion is complete, export sequenced
FLUSH/FUA capability through REQ_FLUSH/FUA flags. REQ_FLUSH means the
device cache should be flushed before executing the request. REQ_FUA
means that the data in the request should be on non-volatile media on
completion.

Block layer will choose the correct way of implementing the semantics
and execute it. The request may be passed to the device directly if
the device can handle it; otherwise, it will be sequenced using one or
more proxy requests. Devices will never see REQ_FLUSH and/or FUA
which it doesn't support.

* QUEUE_ORDERED_* are removed and QUEUE_FSEQ_* are moved into
blk-flush.c.

* REQ_FLUSH w/o data can also be directly passed to drivers without
sequencing but some drivers assume that zero length requests don't
have rq->bio which isn't true for these requests requiring the use
of proxy requests.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
block/blk-core.c | 2 +-
block/blk-flush.c | 85 ++++++++++++++++++++++++++----------------------
block/blk.h | 3 ++
include/linux/blkdev.h | 38 +--------------------
4 files changed, 52 insertions(+), 76 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index efe391b..c00ace2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1204,7 +1204,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)

spin_lock_irq(q->queue_lock);

- if (bio->bi_rw & REQ_HARDBARRIER) {
+ if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
where = ELEVATOR_INSERT_FRONT;
goto get_rq;
}
diff --git a/block/blk-flush.c b/block/blk-flush.c
index dd87322..452c552 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -1,5 +1,5 @@
/*
- * Functions related to barrier IO handling
+ * Functions to sequence FLUSH and FUA writes.
*/
#include <linux/kernel.h>
#include <linux/module.h>
@@ -9,6 +9,15 @@

#include "blk.h"

+/* FLUSH/FUA sequences */
+enum {
+ QUEUE_FSEQ_STARTED = (1 << 0), /* flushing in progress */
+ QUEUE_FSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */
+ QUEUE_FSEQ_DATA = (1 << 2), /* data write in progress */
+ QUEUE_FSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */
+ QUEUE_FSEQ_DONE = (1 << 4),
+};
+
static struct request *queue_next_fseq(struct request_queue *q);

unsigned blk_flush_cur_seq(struct request_queue *q)
@@ -79,6 +88,7 @@ static void queue_flush(struct request_queue *q, struct request *rq,

static struct request *queue_next_fseq(struct request_queue *q)
{
+ struct request *orig_rq = q->orig_flush_rq;
struct request *rq = &q->flush_rq;

switch (blk_flush_cur_seq(q)) {
@@ -87,12 +97,11 @@ static struct request *queue_next_fseq(struct request_queue *q)
break;

case QUEUE_FSEQ_DATA:
- /* initialize proxy request and queue it */
+ /* initialize proxy request, inherit FLUSH/FUA and queue it */
blk_rq_init(q, rq);
- init_request_from_bio(rq, q->orig_flush_rq->bio);
- rq->cmd_flags &= ~REQ_HARDBARRIER;
- if (q->ordered & QUEUE_ORDERED_DO_FUA)
- rq->cmd_flags |= REQ_FUA;
+ init_request_from_bio(rq, orig_rq->bio);
+ rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
+ rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
rq->end_io = flush_data_end_io;

elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
@@ -110,60 +119,58 @@ static struct request *queue_next_fseq(struct request_queue *q)

struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
+ unsigned int fflags = q->flush_flags; /* may change, cache it */
+ bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
+ bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
+ bool do_postflush = has_flush && !has_fua && (rq->cmd_flags & REQ_FUA);
unsigned skip = 0;

- if (!(rq->cmd_flags & REQ_HARDBARRIER))
+ /*
+ * Special case. If there's data but flush is not necessary,
+ * the request can be issued directly.
+ *
+ * Flush w/o data should be able to be issued directly too but
+ * currently some drivers assume that rq->bio contains
+ * non-zero data if it isn't NULL and empty FLUSH requests
+ * getting here usually have bio's without data.
+ */
+ if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
+ rq->cmd_flags &= ~REQ_FLUSH;
+ if (!has_fua)
+ rq->cmd_flags &= ~REQ_FUA;
return rq;
+ }

+ /*
+ * Sequenced flushes can't be processed in parallel. If
+ * another one is already in progress, queue for later
+ * processing.
+ */
if (q->flush_seq) {
- /*
- * Sequenced flush is already in progress and they
- * can't be processed in parallel. Queue for later
- * processing.
- */
list_move_tail(&rq->queuelist, &q->pending_flushes);
return NULL;
}

- if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) {
- /*
- * Queue ordering not supported. Terminate
- * with prejudice.
- */
- blk_dequeue_request(rq);
- __blk_end_request_all(rq, -EOPNOTSUPP);
- return NULL;
- }
-
/*
* Start a new flush sequence
*/
q->flush_err = 0;
- q->ordered = q->next_ordered;
q->flush_seq |= QUEUE_FSEQ_STARTED;

- /*
- * For an empty barrier, there's no actual BAR request, which
- * in turn makes POSTFLUSH unnecessary. Mask them off.
- */
- if (!blk_rq_sectors(rq))
- q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
- QUEUE_ORDERED_DO_POSTFLUSH);
-
- /* stash away the original request */
+ /* adjust FLUSH/FUA of the original request and stash it away */
+ rq->cmd_flags &= ~REQ_FLUSH;
+ if (!has_fua)
+ rq->cmd_flags &= ~REQ_FUA;
blk_dequeue_request(rq);
q->orig_flush_rq = rq;

- if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
+ /* skip unneded sequences and return the first one */
+ if (!do_preflush)
skip |= QUEUE_FSEQ_PREFLUSH;
-
- if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
+ if (!blk_rq_sectors(rq))
skip |= QUEUE_FSEQ_DATA;
-
- if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
+ if (!do_postflush)
skip |= QUEUE_FSEQ_POSTFLUSH;
-
- /* complete skipped sequences and return the first sequence */
return blk_flush_complete_seq(q, skip, 0);
}

diff --git a/block/blk.h b/block/blk.h
index 24b92bd..a09c18b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -60,6 +60,9 @@ static inline struct request *__elv_next_request(struct request_queue *q)
while (1) {
while (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
+ if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
+ rq == &q->flush_rq)
+ return rq;
rq = blk_do_flush(q, rq);
if (rq)
return rq;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 87e58f0..5ce0696 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -357,7 +357,6 @@ struct request_queue
/*
* for flush operations
*/
- unsigned int ordered, next_ordered;
unsigned int flush_flags;
unsigned int flush_seq;
int flush_err;
@@ -464,40 +463,6 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
__clear_bit(flag, &q->queue_flags);
}

-enum {
- /*
- * Hardbarrier is supported with one of the following methods.
- *
- * NONE : hardbarrier unsupported
- * DRAIN : ordering by draining is enough
- * DRAIN_FLUSH : ordering by draining w/ pre and post flushes
- * DRAIN_FUA : ordering by draining w/ pre flush and FUA write
- */
- QUEUE_ORDERED_DO_PREFLUSH = 0x10,
- QUEUE_ORDERED_DO_BAR = 0x20,
- QUEUE_ORDERED_DO_POSTFLUSH = 0x40,
- QUEUE_ORDERED_DO_FUA = 0x80,
-
- QUEUE_ORDERED_NONE = 0x00,
-
- QUEUE_ORDERED_DRAIN = QUEUE_ORDERED_DO_BAR,
- QUEUE_ORDERED_DRAIN_FLUSH = QUEUE_ORDERED_DRAIN |
- QUEUE_ORDERED_DO_PREFLUSH |
- QUEUE_ORDERED_DO_POSTFLUSH,
- QUEUE_ORDERED_DRAIN_FUA = QUEUE_ORDERED_DRAIN |
- QUEUE_ORDERED_DO_PREFLUSH |
- QUEUE_ORDERED_DO_FUA,
-
- /*
- * FLUSH/FUA sequences.
- */
- QUEUE_FSEQ_STARTED = (1 << 0), /* flushing in progress */
- QUEUE_FSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */
- QUEUE_FSEQ_DATA = (1 << 2), /* data write in progress */
- QUEUE_FSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */
- QUEUE_FSEQ_DONE = (1 << 4),
-};
-
#define blk_queue_plugged(q) test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
#define blk_queue_tagged(q) test_bit(QUEUE_FLAG_QUEUED, &(q)->queue_flags)
#define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
@@ -576,7 +541,8 @@ static inline void blk_clear_queue_full(struct request_queue *q, int sync)
* it already be started by driver.
*/
#define RQ_NOMERGE_FLAGS \
- (REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER)
+ (REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER | \
+ REQ_FLUSH | REQ_FUA)
#define rq_mergeable(rq) \
(!((rq)->cmd_flags & RQ_NOMERGE_FLAGS) && \
(((rq)->cmd_flags & REQ_DISCARD) || \
--
1.7.1

2010-08-12 12:47:08

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/11] block: misc cleanups in barrier code

Make the following cleanups in preparation of barrier/flush update.

* blk_do_ordered() declaration is moved from include/linux/blkdev.h to
block/blk.h.

* blk_do_ordered() now returns pointer to struct request, with %NULL
meaning "try the next request" and ERR_PTR(-EAGAIN) "try again
later". The third case will be dropped with further changes.

* In the initialization of proxy barrier request, data direction is
already set by init_request_from_bio(). Drop unnecessary explicit
REQ_WRITE setting and move init_request_from_bio() above REQ_FUA
flag setting.

* add_request() is collapsed into __make_request().

These changes don't make any functional difference.

Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-barrier.c | 32 ++++++++++++++------------------
block/blk-core.c | 21 ++++-----------------
block/blk.h | 7 +++++--
include/linux/blkdev.h | 1 -
4 files changed, 23 insertions(+), 38 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index ed0aba5..f1be85b 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -110,9 +110,9 @@ static void queue_flush(struct request_queue *q, unsigned which)
elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
}

-static inline bool start_ordered(struct request_queue *q, struct request **rqp)
+static inline struct request *start_ordered(struct request_queue *q,
+ struct request *rq)
{
- struct request *rq = *rqp;
unsigned skip = 0;

q->orderr = 0;
@@ -149,11 +149,9 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)

/* initialize proxy request and queue it */
blk_rq_init(q, rq);
- if (bio_data_dir(q->orig_bar_rq->bio) == WRITE)
- rq->cmd_flags |= REQ_WRITE;
+ init_request_from_bio(rq, q->orig_bar_rq->bio);
if (q->ordered & QUEUE_ORDERED_DO_FUA)
rq->cmd_flags |= REQ_FUA;
- init_request_from_bio(rq, q->orig_bar_rq->bio);
rq->end_io = bar_end_io;

elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
@@ -171,27 +169,26 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)
else
skip |= QUEUE_ORDSEQ_DRAIN;

- *rqp = rq;
-
/*
* Complete skipped sequences. If whole sequence is complete,
- * return false to tell elevator that this request is gone.
+ * return %NULL to tell elevator that this request is gone.
*/
- return !blk_ordered_complete_seq(q, skip, 0);
+ if (blk_ordered_complete_seq(q, skip, 0))
+ rq = NULL;
+ return rq;
}

-bool blk_do_ordered(struct request_queue *q, struct request **rqp)
+struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
{
- struct request *rq = *rqp;
const int is_barrier = rq->cmd_type == REQ_TYPE_FS &&
(rq->cmd_flags & REQ_HARDBARRIER);

if (!q->ordseq) {
if (!is_barrier)
- return true;
+ return rq;

if (q->next_ordered != QUEUE_ORDERED_NONE)
- return start_ordered(q, rqp);
+ return start_ordered(q, rq);
else {
/*
* Queue ordering not supported. Terminate
@@ -199,8 +196,7 @@ bool blk_do_ordered(struct request_queue *q, struct request **rqp)
*/
blk_dequeue_request(rq);
__blk_end_request_all(rq, -EOPNOTSUPP);
- *rqp = NULL;
- return false;
+ return NULL;
}
}

@@ -211,14 +207,14 @@ bool blk_do_ordered(struct request_queue *q, struct request **rqp)
/* Special requests are not subject to ordering rules. */
if (rq->cmd_type != REQ_TYPE_FS &&
rq != &q->pre_flush_rq && rq != &q->post_flush_rq)
- return true;
+ return rq;

/* Ordered by draining. Wait for turn. */
WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
- *rqp = NULL;
+ rq = ERR_PTR(-EAGAIN);

- return true;
+ return rq;
}

static void bio_end_empty_barrier(struct bio *bio, int err)
diff --git a/block/blk-core.c b/block/blk-core.c
index 3f802dd..ed8ef89 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1037,22 +1037,6 @@ void blk_insert_request(struct request_queue *q, struct request *rq,
}
EXPORT_SYMBOL(blk_insert_request);

-/*
- * add-request adds a request to the linked list.
- * queue lock is held and interrupts disabled, as we muck with the
- * request queue list.
- */
-static inline void add_request(struct request_queue *q, struct request *req)
-{
- drive_stat_acct(req, 1);
-
- /*
- * elevator indicated where it wants this request to be
- * inserted at elevator_merge time
- */
- __elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
-}
-
static void part_round_stats_single(int cpu, struct hd_struct *part,
unsigned long now)
{
@@ -1316,7 +1300,10 @@ get_rq:
req->cpu = blk_cpu_to_group(smp_processor_id());
if (queue_should_plug(q) && elv_queue_empty(q))
blk_plug_device(q);
- add_request(q, req);
+
+ /* insert the request into the elevator */
+ drive_stat_acct(req, 1);
+ __elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
out:
if (unplug || !queue_should_plug(q))
__generic_unplug_device(q);
diff --git a/block/blk.h b/block/blk.h
index 6e7dc87..874eb4e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -51,6 +51,8 @@ static inline void blk_clear_rq_complete(struct request *rq)
*/
#define ELV_ON_HASH(rq) (!hlist_unhashed(&(rq)->hash))

+struct request *blk_do_ordered(struct request_queue *q, struct request *rq);
+
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
@@ -58,8 +60,9 @@ static inline struct request *__elv_next_request(struct request_queue *q)
while (1) {
while (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (blk_do_ordered(q, &rq))
- return rq;
+ rq = blk_do_ordered(q, rq);
+ if (rq)
+ return !IS_ERR(rq) ? rq : NULL;
}

if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6003f7c..21baa19 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -867,7 +867,6 @@ extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
-extern bool blk_do_ordered(struct request_queue *, struct request **);
extern unsigned blk_ordered_cur_seq(struct request_queue *);
extern unsigned blk_ordered_req_seq(struct request *);
extern bool blk_ordered_complete_seq(struct request_queue *, unsigned, int);
--
1.7.1

2010-08-12 21:24:46

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 10/11] fs, block: propagate REQ_FLUSH/FUA interface to upper layers

On Thu 12-08-10 14:41:30, Tejun Heo wrote:
> Propagate deprecation of REQ_HARDBARRIER and new REQ_FLUSH/FUA
> interface to upper layers.
>
> * WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
> WRITE_FLUSH_FUA are added.
>
> * REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
> copied from bio to request.
>
> * BH_Ordered is marked deprecated and BH_Flush and BH_FUA are added.
Deprecating BH_Ordered is fine but I wouldn't introduce new BH flags for
this. BH flags should be used for buffer state, not for encoding how the
buffer should be written (there were actually bugs in the past because of
this). Being able to set proper flags when calling submit_bh() in the rw
parameter is enough.

Honza

>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> ---
> fs/buffer.c | 27 ++++++++++++++++-----------
> include/linux/blk_types.h | 2 +-
> include/linux/buffer_head.h | 8 ++++++--
> include/linux/fs.h | 20 +++++++++++++-------
> 4 files changed, 36 insertions(+), 21 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d54812b..ec32fbb 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3019,18 +3019,23 @@ int submit_bh(int rw, struct buffer_head * bh)
> BUG_ON(buffer_delay(bh));
> BUG_ON(buffer_unwritten(bh));
>
> - /*
> - * Mask in barrier bit for a write (could be either a WRITE or a
> - * WRITE_SYNC
> - */
> - if (buffer_ordered(bh) && (rw & WRITE))
> - rw |= WRITE_BARRIER;
> + if (rw & WRITE) {
> + /* ordered is deprecated, will be removed */
> + if (buffer_ordered(bh))
> + rw |= WRITE_BARRIER;
>
> - /*
> - * Only clear out a write error when rewriting
> - */
> - if (test_set_buffer_req(bh) && (rw & WRITE))
> - clear_buffer_write_io_error(bh);
> + if (buffer_flush(bh))
> + rw |= WRITE_FLUSH;
> +
> + if (buffer_fua(bh))
> + rw |= WRITE_FUA;
> +
> + /*
> + * Only clear out a write error when rewriting
> + */
> + if (test_set_buffer_req(bh))
> + clear_buffer_write_io_error(bh);
> + }
>
> /*
> * from here on down, it's all bio -- do the initial mapping,
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 8e9887d..6609fc0 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -164,7 +164,7 @@ enum rq_flag_bits {
> (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
> #define REQ_COMMON_MASK \
> (REQ_WRITE | REQ_FAILFAST_MASK | REQ_HARDBARRIER | REQ_SYNC | \
> - REQ_META| REQ_DISCARD | REQ_NOIDLE)
> + REQ_META | REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)
>
> #define REQ_UNPLUG (1 << __REQ_UNPLUG)
> #define REQ_RAHEAD (1 << __REQ_RAHEAD)
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 1b9ba19..498bd8b 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -32,8 +32,10 @@ enum bh_state_bits {
> BH_Delay, /* Buffer is not yet allocated on disk */
> BH_Boundary, /* Block is followed by a discontiguity */
> BH_Write_EIO, /* I/O error on write */
> - BH_Ordered, /* ordered write */
> - BH_Eopnotsupp, /* operation not supported (barrier) */
> + BH_Ordered, /* DEPRECATED: ordered write */
> + BH_Eopnotsupp, /* DEPRECATED: operation not supported (barrier) */
> + BH_Flush, /* Flush device cache before executing IO */
> + BH_FUA, /* Data should be on non-volatile media on completion */
> BH_Unwritten, /* Buffer is allocated on disk but not written */
> BH_Quiet, /* Buffer Error Prinks to be quiet */
>
> @@ -126,6 +128,8 @@ BUFFER_FNS(Delay, delay)
> BUFFER_FNS(Boundary, boundary)
> BUFFER_FNS(Write_EIO, write_io_error)
> BUFFER_FNS(Ordered, ordered)
> +BUFFER_FNS(Flush, flush)
> +BUFFER_FNS(FUA, fua)
> BUFFER_FNS(Eopnotsupp, eopnotsupp)
> BUFFER_FNS(Unwritten, unwritten)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4ebd8eb..6e30b0b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -138,13 +138,13 @@ struct inodes_stat_t {
> * SWRITE_SYNC
> * SWRITE_SYNC_PLUG Like WRITE_SYNC/WRITE_SYNC_PLUG, but locks the buffer.
> * See SWRITE.
> - * WRITE_BARRIER Like WRITE_SYNC, but tells the block layer that all
> - * previously submitted writes must be safely on storage
> - * before this one is started. Also guarantees that when
> - * this write is complete, it itself is also safely on
> - * storage. Prevents reordering of writes on both sides
> - * of this IO.
> - *
> + * WRITE_BARRIER DEPRECATED. Always fails. Use FLUSH/FUA instead.
> + * WRITE_FLUSH Like WRITE_SYNC but with preceding cache flush.
> + * WRITE_FUA Like WRITE_SYNC but data is guaranteed to be on
> + * non-volatile media on completion.
> + * WRITE_FLUSH_FUA Combination of WRITE_FLUSH and FUA. The IO is preceded
> + * by a cache flush and data is guaranteed to be on
> + * non-volatile media on completion.
> */
> #define RW_MASK REQ_WRITE
> #define RWA_MASK REQ_RAHEAD
> @@ -162,6 +162,12 @@ struct inodes_stat_t {
> #define WRITE_META (WRITE | REQ_META)
> #define WRITE_BARRIER (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
> REQ_HARDBARRIER)
> +#define WRITE_FLUSH (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
> + REQ_FLUSH)
> +#define WRITE_FUA (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
> + REQ_FUA)
> +#define WRITE_FLUSH_FUA (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
> + REQ_FLUSH | REQ_FUA)
> #define SWRITE_SYNC_PLUG (SWRITE | REQ_SYNC | REQ_NOIDLE)
> #define SWRITE_SYNC (SWRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG)
>
> --
> 1.7.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-08-13 07:22:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 10/11] fs, block: propagate REQ_FLUSH/FUA interface to upper layers

Hello, Jan.

On 08/12/2010 11:24 PM, Jan Kara wrote:
> On Thu 12-08-10 14:41:30, Tejun Heo wrote:
>> Propagate deprecation of REQ_HARDBARRIER and new REQ_FLUSH/FUA
>> interface to upper layers.
>>
>> * WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
>> WRITE_FLUSH_FUA are added.
>>
>> * REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
>> copied from bio to request.
>>
>> * BH_Ordered is marked deprecated and BH_Flush and BH_FUA are added.
>
> Deprecating BH_Ordered is fine but I wouldn't introduce new BH flags for
> this. BH flags should be used for buffer state, not for encoding how the
> buffer should be written (there were actually bugs in the past because of
> this). Being able to set proper flags when calling submit_bh() in the rw
> parameter is enough.

Ah, okay, I was just trying to match the BH_Ordered usage but you're
saying just requiring submit_bh() users to specify appropriate REQ_*
(or WRITE_*) in @rw is okay, right? I'll drop the bh part then.

Thanks.

--
tejun

2010-08-13 07:47:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 10/11] fs, block: propagate REQ_FLUSH/FUA interface to upper layers

FYI: I've already sent a patch to kill BH_Ordered, hopefully Al will
still push it in this merge window.

2010-08-13 11:49:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

The patchset looks functionally correct to me, and with a small patch
to make use of WRITE_FUA_FLUSH survives xfstests, and instrumenting the
underlying qemu shows that we actually get the flush requests where we should.

No performance or power fail testing done yet.

But I do not like the transition very much. The new WRITE_FUA_FLUSH
request is exactly what filesystems expect from a current barrier
request, so I'd rather move to that functionality without breaking stuff
inbetween.

So if it was to me I'd keep patches 1, 2, 4 and 5 from your series, than
a main one to relax barrier semantics, then have the renaming patches 7
and 8, and possible keep patch 11 separate from the main implementation
change, and if absolutely also a separate one to introduce REQ_FUA and
REQ_FLUSH in the bio interface, but keep things working while doing
this.

Then we can patches do disable the reiserfs barrier "optimization" as
the very first one, and DM/MD support which I'm currently working on
as the last one and we can start doing the heavy testing.

2010-08-13 12:56:08

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Tejun Heo, on 08/12/2010 04:41 PM wrote:
> Each filesystem needs to be updated to enforce request
> ordering themselves and then to use REQ_FLUSH/FUA mechanism.

I generally agree with the patchset, but I believe this particular move
is a really bad move.

I'm not mentioning the obvious that a common functionality (enforcing
requests ordering in this case) should be handled by a common library,
but not internally by a zillion file systems Linux has.

The worst in this move is that it would hide all the requests ordering
semantic inside file systems in, most likely, a very much unclear way.
That would lead that if I or someone else decide to implement the
"hardware offload" of requests ordering (ORDERED requests), I or he/she
would not be able to see any improvement until at least one file system
be changed to be able to use it. Worse, if the implementor can't
demonstrate the improvement, how can he encourage file systems
developers to update their file systems? Which, basically, would mean
that only a person with *BOTH* deep storage and file systems internals
knowledge can do the job. How many do you know such people? Both storage
and file systems topics are very wide and tricky, so nearly always
people specialize in one of them, not both.

Thus, this move would basically mean that the proper ordered queuing
would probably never be implemented in Linux.

I believe, much better would be to create a common interface, which file
systems would use to enforce requests order, when they need it.

Advantages of this approach:

1. The ordering requirements of file systems would be clear.

2. They would be handled in one place by a common code.

3. Any storage level expert can try to implement ordered queuing without
a deep dive into file systems design and implementation.

I already suggested such interface in
http://marc.info/?l=linux-scsi&m=128077574815881&w=2. Internally for the
moment it can be implemented using existing REQ_FLUSH/FUA/etc. and
waiting for all the requests in the group to finish. As a nice side
effect, if a device doesn't support FUA, it would be possible to issue
SYNC_CACHE command(s) only for required blocks, not for the whole device
as it is done now.

If requested, I can develop the interface further.

Vlad

2010-08-13 13:02:27

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: [PATCH 02/11] block: kill QUEUE_ORDERED_BY_TAG

Hello Tejun,

Tejun Heo, on 08/12/2010 04:41 PM wrote:
> Nobody is making meaningful use of ORDERED_BY_TAG now and queue
> draining for barrier requests will be removed soon which will render
> the advantage of tag ordering moot.

Have you seen Hannes Reinecke's and my measurements in
http://marc.info/?l=linux-scsi&m=128110662528485&w=2 and
http://marc.info/?l=linux-scsi&m=128111995217405&w=2 correspondingly?

If yes, what else evidences do you need to see that the tag ordering is
a big performance win?

Vlad

2010-08-13 13:07:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 02/11] block: kill QUEUE_ORDERED_BY_TAG

On Fri, Aug 13, 2010 at 04:56:32PM +0400, Vladislav Bolkhovitin wrote:
> Tejun Heo, on 08/12/2010 04:41 PM wrote:
> >Nobody is making meaningful use of ORDERED_BY_TAG now and queue
> >draining for barrier requests will be removed soon which will render
> >the advantage of tag ordering moot.
>
> Have you seen Hannes Reinecke's and my measurements in
> http://marc.info/?l=linux-scsi&m=128110662528485&w=2 and
> http://marc.info/?l=linux-scsi&m=128111995217405&w=2 correspondingly?
>
> If yes, what else evidences do you need to see that the tag ordering is
> a big performance win?

It's not tag odering that is a win but big queue depth. That's what you
measured and what I fully agree on. I haven't been able to get out of
Hannes what he actually measured.

And if you'd actually look at the patchset allowing deep queues is
exactly what it allows us, and while I haven't done testing on this
patchset but only on my previous version it does get us back to use
the full potential of large arrays exactly because of that.

2010-08-13 13:17:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Fri, Aug 13, 2010 at 04:55:33PM +0400, Vladislav Bolkhovitin wrote:
> I'm not mentioning the obvious that a common functionality (enforcing
> requests ordering in this case) should be handled by a common library,
> but not internally by a zillion file systems Linux has.

I/O ordering is still handled mostly by common code, that is the
pagecache and the buffercache, although a few filesystems like XFS and
btrfs have their own implementation of the second one.

The current ordered semantics of barriers have only successfull
implemented by a complete queue drain, and not effectively been used
by filesystems. This patchset removes the bogus global ordering
enforced by the block layer whenever a filesystems wants to be able
to use cache flushes, and because of that allows deeper outstanding
queue depth I/O with less latency.

Now I know you in particular are a fan of scsi ordered tags. And as I
told you before I'm open to review such an implementation if it shows
us any advantages. Adding it after this patch is in fact not any more
complicated than before, I'd almost be tempted it's easier as you don't
have to plug it into the complex state machine we used for barriers, and
more importantly we drop the requirement for the barrier sequence to
be atomic, which in fact made implementing barriers using tagged queues
impossible with the current scsi layer.

As far as playing with ordered tags it's just adding a new flag for
it on the bio that gets passed down to the driver. For a final version
you'd need a queue-level feature if it's supported, but you don't
even need that for the initial work. Then you can implement a
variant of blk_do_flush that does away with queueing additional requests
once finish but queues all two or three at the same time with your
new ordered flag set, at which point you are back to the level or
ordered tag usage that the old code allows. You're still left with
all the hard problems of actually implementing error handling for it
and using it higher up in the filesystem and generic page cache code.

I'd really love to see your results, up to the point of just trying
that once I get a little spare time. But my theory is that it won't
help us - the problem with ordered tags is that they enforce global
ordering while we currently have local ordering. While it will reduce
the latency for the process waiting for an fsync or similar it will
affect other I/O going on in the background and reduce the devices
ability to reorder that I/O.

So for now this patch set is a massive improvement of performance for
workloads we care about, while removing the interface we put in place
to allow a theoretical optimization that didn't show up for 8 years
before, and in fact made the interface just complicated enough to make
that optimization so hard.

2010-08-13 13:24:07

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

On 08/13/2010 02:55 PM, Vladislav Bolkhovitin wrote:
> If requested, I can develop the interface further.

I still think the benefit of ordering by tag would be marginal at
best, and what have you guys measured there? Under the current
framework, there's no easy way to measure full ordered-by-tag
implementation. The mechanism for filesystems to communicate the
ordering information (which would be a partially ordered graph) just
isn't there and there is no way the current usage of ordering-by-tag
only for barrier sequence can achieve anything close to that level of
difference.

Ripping out the original ordering by tag mechanism doesn't amount to
much. The use of ordering-by-tag was pretty half-assed there anyway.
If you think exporting full ordering information from filesystem to
the lower layers is worthwhile, please go ahead. It would be very
interesting to see how much actual difference it can make compared to
ordering-by-filesystem and if it's actually better and the added
complexity is manageable, there's no reason not to do that.

Thank you.

--
tejun

2010-08-13 13:51:29

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello, Christoph.

On 08/13/2010 01:48 PM, Christoph Hellwig wrote:
> The patchset looks functionally correct to me, and with a small patch
> to make use of WRITE_FUA_FLUSH survives xfstests, and instrumenting the
> underlying qemu shows that we actually get the flush requests where we should.

Great.

> No performance or power fail testing done yet.
>
> But I do not like the transition very much. The new WRITE_FUA_FLUSH
> request is exactly what filesystems expect from a current barrier
> request, so I'd rather move to that functionality without breaking stuff
> inbetween.
>
> So if it was to me I'd keep patches 1, 2, 4 and 5 from your series, than
> a main one to relax barrier semantics, then have the renaming patches 7
> and 8, and possible keep patch 11 separate from the main implementation
> change, and if absolutely also a separate one to introduce REQ_FUA and
> REQ_FLUSH in the bio interface, but keep things working while doing
> this.

There are two reason to avoid changing the meaning of REQ_HARDBARRIER
and just deprecate it. One is to avoid breaking filesystems'
expectations underneath it. Please note that there are out-of-tree
filesystems too. I think it would be too dangerous to relax
REQ_HARDBARRIER.

Another is that pseudo block layer drivers (loop, virtio_blk,
md/dm...) have assumptions about REQ_HARDBARRIER behavior and things
would be broken in obscure ways between REQ_HARDBARRIER semantics
change and updates to each of those drivers, so I don't really think
changing the semantics while the mechanism is online is a good idea.

> Then we can patches do disable the reiserfs barrier "optimization" as
> the very first one, and DM/MD support which I'm currently working on
> as the last one and we can start doing the heavy testing.

Oops, I've already converted loop, virtio_blk/lguest and am working on
md/dm right now too. I'm almost done with md and now doing dm. :-)
Maybe we should post them right now so that we don't waste too much
time trying to solve the same problems?

Thanks.

--
tejun

2010-08-13 14:38:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Fri, Aug 13, 2010 at 03:48:59PM +0200, Tejun Heo wrote:
> There are two reason to avoid changing the meaning of REQ_HARDBARRIER
> and just deprecate it. One is to avoid breaking filesystems'
> expectations underneath it. Please note that there are out-of-tree
> filesystems too. I think it would be too dangerous to relax
> REQ_HARDBARRIER.

Note that the renaming patch would include a move from REQ_HARDBARRIER
to REQ_FLUSH_FUA, so things just using REQ_HARDBARRIER will fail to
compile. And while out of tree filesystems do exist they it's their
problem to keep up with kernel changes. They decide not to be part
of the Linux kernel, so it'll be their job to keep up with it.

> Another is that pseudo block layer drivers (loop, virtio_blk,
> md/dm...) have assumptions about REQ_HARDBARRIER behavior and things
> would be broken in obscure ways between REQ_HARDBARRIER semantics
> change and updates to each of those drivers, so I don't really think
> changing the semantics while the mechanism is online is a good idea.

I don't think doing those changes in a separate commit is a good idea.

> > Then we can patches do disable the reiserfs barrier "optimization" as
> > the very first one, and DM/MD support which I'm currently working on
> > as the last one and we can start doing the heavy testing.
>
> Oops, I've already converted loop, virtio_blk/lguest and am working on
> md/dm right now too. I'm almost done with md and now doing dm. :-)
> Maybe we should post them right now so that we don't waste too much
> time trying to solve the same problems?

Here's the dm patch. It only handles normal bio based dm yet, which
I understand and can test. request based dm (multipath) still needs
work.


Index: linux-2.6/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-crypt.c 2010-08-13 16:11:04.207010218 +0200
+++ linux-2.6/drivers/md/dm-crypt.c 2010-08-13 16:11:10.048003862 +0200
@@ -1249,7 +1249,7 @@ static int crypt_map(struct dm_target *t
struct dm_crypt_io *io;
struct crypt_config *cc;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
cc = ti->private;
bio->bi_bdev = cc->dev->bdev;
return DM_MAPIO_REMAPPED;
Index: linux-2.6/drivers/md/dm-io.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-io.c 2010-08-13 16:11:04.213011894 +0200
+++ linux-2.6/drivers/md/dm-io.c 2010-08-13 16:11:10.049003792 +0200
@@ -364,7 +364,7 @@ static void dispatch_io(int rw, unsigned
*/
for (i = 0; i < num_regions; i++) {
*dp = old_pages;
- if (where[i].count || (rw & REQ_HARDBARRIER))
+ if (where[i].count || (rw & REQ_FLUSH))
do_region(rw, i, where + i, dp, io);
}

@@ -412,8 +412,8 @@ retry:
}
set_current_state(TASK_RUNNING);

- if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
- rw &= ~REQ_HARDBARRIER;
+ if (io->eopnotsupp_bits && (rw & REQ_FLUSH)) {
+ rw &= ~REQ_FLUSH;
goto retry;
}

Index: linux-2.6/drivers/md/dm-raid1.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-raid1.c 2010-08-13 16:11:04.220013431 +0200
+++ linux-2.6/drivers/md/dm-raid1.c 2010-08-13 16:11:10.054018319 +0200
@@ -670,7 +670,7 @@ static void do_writes(struct mirror_set
bio_list_init(&requeue);

while ((bio = bio_list_pop(writes))) {
- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
bio_list_add(&sync, bio);
continue;
}
@@ -1203,7 +1203,7 @@ static int mirror_end_io(struct dm_targe
* We need to dec pending if this was a write.
*/
if (rw == WRITE) {
- if (likely(!bio_empty_barrier(bio)))
+ if (!bio_empty_flush(bio))
dm_rh_dec(ms->rh, map_context->ll);
return error;
}
Index: linux-2.6/drivers/md/dm-region-hash.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-region-hash.c 2010-08-13 16:11:04.228004631 +0200
+++ linux-2.6/drivers/md/dm-region-hash.c 2010-08-13 16:11:10.060003932 +0200
@@ -399,7 +399,7 @@ void dm_rh_mark_nosync(struct dm_region_
region_t region = dm_rh_bio_to_region(rh, bio);
int recovering = 0;

- if (bio_empty_barrier(bio)) {
+ if (bio_empty_flush(bio)) {
rh->barrier_failure = 1;
return;
}
@@ -524,7 +524,7 @@ void dm_rh_inc_pending(struct dm_region_
struct bio *bio;

for (bio = bios->head; bio; bio = bio->bi_next) {
- if (bio_empty_barrier(bio))
+ if (bio_empty_flush(bio))
continue;
rh_inc(rh, dm_rh_bio_to_region(rh, bio));
}
Index: linux-2.6/drivers/md/dm-snap.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-snap.c 2010-08-13 16:11:04.238004701 +0200
+++ linux-2.6/drivers/md/dm-snap.c 2010-08-13 16:11:10.067005677 +0200
@@ -1581,7 +1581,7 @@ static int snapshot_map(struct dm_target
chunk_t chunk;
struct dm_snap_pending_exception *pe = NULL;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
bio->bi_bdev = s->cow->bdev;
return DM_MAPIO_REMAPPED;
}
@@ -1685,7 +1685,7 @@ static int snapshot_merge_map(struct dm_
int r = DM_MAPIO_REMAPPED;
chunk_t chunk;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
if (!map_context->flush_request)
bio->bi_bdev = s->origin->bdev;
else
@@ -2123,7 +2123,7 @@ static int origin_map(struct dm_target *
struct dm_dev *dev = ti->private;
bio->bi_bdev = dev->bdev;

- if (unlikely(bio_empty_barrier(bio)))
+ if (bio_empty_flush(bio))
return DM_MAPIO_REMAPPED;

/* Only tell snapshots if this is a write */
Index: linux-2.6/drivers/md/dm-stripe.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-stripe.c 2010-08-13 16:11:04.247011266 +0200
+++ linux-2.6/drivers/md/dm-stripe.c 2010-08-13 16:11:10.072026629 +0200
@@ -214,7 +214,7 @@ static int stripe_map(struct dm_target *
sector_t offset, chunk;
uint32_t stripe;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
BUG_ON(map_context->flush_request >= sc->stripes);
bio->bi_bdev = sc->stripe[map_context->flush_request].dev->bdev;
return DM_MAPIO_REMAPPED;
Index: linux-2.6/drivers/md/dm.c
===================================================================
--- linux-2.6.orig/drivers/md/dm.c 2010-08-13 16:11:04.256004631 +0200
+++ linux-2.6/drivers/md/dm.c 2010-08-13 16:11:37.152005462 +0200
@@ -139,17 +139,6 @@ struct mapped_device {
spinlock_t deferred_lock;

/*
- * An error from the barrier request currently being processed.
- */
- int barrier_error;
-
- /*
- * Protect barrier_error from concurrent endio processing
- * in request-based dm.
- */
- spinlock_t barrier_error_lock;
-
- /*
* Processing queue (flush/barriers)
*/
struct workqueue_struct *wq;
@@ -194,9 +183,6 @@ struct mapped_device {

/* sysfs handle */
struct kobject kobj;
-
- /* zero-length barrier that will be cloned and submitted to targets */
- struct bio barrier_bio;
};

/*
@@ -505,10 +491,6 @@ static void end_io_acct(struct dm_io *io
part_stat_add(cpu, &dm_disk(md)->part0, ticks[rw], duration);
part_stat_unlock();

- /*
- * After this is decremented the bio must not be touched if it is
- * a barrier.
- */
dm_disk(md)->part0.in_flight[rw] = pending =
atomic_dec_return(&md->pending[rw]);
pending += atomic_read(&md->pending[rw^0x1]);
@@ -621,7 +603,7 @@ static void dec_pending(struct dm_io *io
*/
spin_lock_irqsave(&md->deferred_lock, flags);
if (__noflush_suspending(md)) {
- if (!(io->bio->bi_rw & REQ_HARDBARRIER))
+ if (!(io->bio->bi_rw & (REQ_FLUSH|REQ_FUA)))
bio_list_add_head(&md->deferred,
io->bio);
} else
@@ -633,25 +615,13 @@ static void dec_pending(struct dm_io *io
io_error = io->error;
bio = io->bio;

- if (bio->bi_rw & REQ_HARDBARRIER) {
- /*
- * There can be just one barrier request so we use
- * a per-device variable for error reporting.
- * Note that you can't touch the bio after end_io_acct
- */
- if (!md->barrier_error && io_error != -EOPNOTSUPP)
- md->barrier_error = io_error;
- end_io_acct(io);
- free_io(md, io);
- } else {
- end_io_acct(io);
- free_io(md, io);
+ end_io_acct(io);
+ free_io(md, io);

- if (io_error != DM_ENDIO_REQUEUE) {
- trace_block_bio_complete(md->queue, bio);
+ if (io_error != DM_ENDIO_REQUEUE) {
+ trace_block_bio_complete(md->queue, bio);

- bio_endio(bio, io_error);
- }
+ bio_endio(bio, io_error);
}
}
}
@@ -744,23 +714,6 @@ static void end_clone_bio(struct bio *cl
blk_update_request(tio->orig, 0, nr_bytes);
}

-static void store_barrier_error(struct mapped_device *md, int error)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&md->barrier_error_lock, flags);
- /*
- * Basically, the first error is taken, but:
- * -EOPNOTSUPP supersedes any I/O error.
- * Requeue request supersedes any I/O error but -EOPNOTSUPP.
- */
- if (!md->barrier_error || error == -EOPNOTSUPP ||
- (md->barrier_error != -EOPNOTSUPP &&
- error == DM_ENDIO_REQUEUE))
- md->barrier_error = error;
- spin_unlock_irqrestore(&md->barrier_error_lock, flags);
-}
-
/*
* Don't touch any member of the md after calling this function because
* the md may be freed in dm_put() at the end of this function.
@@ -798,13 +751,11 @@ static void free_rq_clone(struct request
static void dm_end_request(struct request *clone, int error)
{
int rw = rq_data_dir(clone);
- int run_queue = 1;
- bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
struct dm_rq_target_io *tio = clone->end_io_data;
struct mapped_device *md = tio->md;
struct request *rq = tio->orig;

- if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+ if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
rq->errors = clone->errors;
rq->resid_len = clone->resid_len;

@@ -818,15 +769,8 @@ static void dm_end_request(struct reques
}

free_rq_clone(clone);
-
- if (unlikely(is_barrier)) {
- if (unlikely(error))
- store_barrier_error(md, error);
- run_queue = 0;
- } else
- blk_end_request_all(rq, error);
-
- rq_completed(md, rw, run_queue);
+ blk_end_request_all(rq, error);
+ rq_completed(md, rw, 1);
}

static void dm_unprep_request(struct request *rq)
@@ -1113,7 +1057,7 @@ static struct bio *split_bvec(struct bio

clone->bi_sector = sector;
clone->bi_bdev = bio->bi_bdev;
- clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER;
+ clone->bi_rw = bio->bi_rw;
clone->bi_vcnt = 1;
clone->bi_size = to_bytes(len);
clone->bi_io_vec->bv_offset = offset;
@@ -1140,7 +1084,6 @@ static struct bio *clone_bio(struct bio

clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
__bio_clone(clone, bio);
- clone->bi_rw &= ~REQ_HARDBARRIER;
clone->bi_destructor = dm_bio_destructor;
clone->bi_sector = sector;
clone->bi_idx = idx;
@@ -1186,7 +1129,7 @@ static void __flush_target(struct clone_
__map_bio(ti, clone, tio);
}

-static int __clone_and_map_empty_barrier(struct clone_info *ci)
+static int __clone_and_map_empty_flush(struct clone_info *ci)
{
unsigned target_nr = 0, flush_nr;
struct dm_target *ti;
@@ -1208,8 +1151,8 @@ static int __clone_and_map(struct clone_
sector_t len = 0, max;
struct dm_target_io *tio;

- if (unlikely(bio_empty_barrier(bio)))
- return __clone_and_map_empty_barrier(ci);
+ if (bio_empty_flush(bio))
+ return __clone_and_map_empty_flush(ci);

ti = dm_table_find_target(ci->map, ci->sector);
if (!dm_target_is_valid(ti))
@@ -1308,11 +1251,7 @@ static void __split_and_process_bio(stru

ci.map = dm_get_live_table(md);
if (unlikely(!ci.map)) {
- if (!(bio->bi_rw & REQ_HARDBARRIER))
- bio_io_error(bio);
- else
- if (!md->barrier_error)
- md->barrier_error = -EIO;
+ bio_io_error(bio);
return;
}

@@ -1326,7 +1265,7 @@ static void __split_and_process_bio(stru
spin_lock_init(&ci.io->endio_lock);
ci.sector = bio->bi_sector;
ci.sector_count = bio_sectors(bio);
- if (unlikely(bio_empty_barrier(bio)))
+ if (bio_empty_flush(bio))
ci.sector_count = 1;
ci.idx = bio->bi_idx;

@@ -1420,8 +1359,7 @@ static int _dm_request(struct request_qu
* If we're suspended or the thread is processing barriers
* we have to queue this io for later.
*/
- if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
- unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+ if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags))) {
up_read(&md->io_lock);

if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
@@ -1873,7 +1811,6 @@ static struct mapped_device *alloc_dev(i
init_rwsem(&md->io_lock);
mutex_init(&md->suspend_lock);
spin_lock_init(&md->deferred_lock);
- spin_lock_init(&md->barrier_error_lock);
rwlock_init(&md->map_lock);
atomic_set(&md->holders, 1);
atomic_set(&md->open_count, 0);
@@ -2233,38 +2170,6 @@ static int dm_wait_for_completion(struct
return r;
}

-static void dm_flush(struct mapped_device *md)
-{
- dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
- bio_init(&md->barrier_bio);
- md->barrier_bio.bi_bdev = md->bdev;
- md->barrier_bio.bi_rw = WRITE_BARRIER;
- __split_and_process_bio(md, &md->barrier_bio);
-
- dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-}
-
-static void process_barrier(struct mapped_device *md, struct bio *bio)
-{
- md->barrier_error = 0;
-
- dm_flush(md);
-
- if (!bio_empty_barrier(bio)) {
- __split_and_process_bio(md, bio);
- dm_flush(md);
- }
-
- if (md->barrier_error != DM_ENDIO_REQUEUE)
- bio_endio(bio, md->barrier_error);
- else {
- spin_lock_irq(&md->deferred_lock);
- bio_list_add_head(&md->deferred, bio);
- spin_unlock_irq(&md->deferred_lock);
- }
-}
-
/*
* Process the deferred bios
*/
@@ -2290,12 +2195,8 @@ static void dm_wq_work(struct work_struc

if (dm_request_based(md))
generic_make_request(c);
- else {
- if (c->bi_rw & REQ_HARDBARRIER)
- process_barrier(md, c);
- else
- __split_and_process_bio(md, c);
- }
+ else
+ __split_and_process_bio(md, c);

down_write(&md->io_lock);
}
@@ -2326,8 +2227,6 @@ static int dm_rq_barrier(struct mapped_d
struct dm_target *ti;
struct request *clone;

- md->barrier_error = 0;
-
for (i = 0; i < num_targets; i++) {
ti = dm_table_get_target(map, i);
for (j = 0; j < ti->num_flush_requests; j++) {
@@ -2341,7 +2240,7 @@ static int dm_rq_barrier(struct mapped_d
dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
dm_table_put(map);

- return md->barrier_error;
+ return 0;
}

static void dm_rq_barrier_work(struct work_struct *work)
Index: linux-2.6/include/linux/bio.h
===================================================================
--- linux-2.6.orig/include/linux/bio.h 2010-08-13 16:11:04.268004351 +0200
+++ linux-2.6/include/linux/bio.h 2010-08-13 16:11:10.082005677 +0200
@@ -66,8 +66,8 @@
#define bio_offset(bio) bio_iovec((bio))->bv_offset
#define bio_segments(bio) ((bio)->bi_vcnt - (bio)->bi_idx)
#define bio_sectors(bio) ((bio)->bi_size >> 9)
-#define bio_empty_barrier(bio) \
- ((bio->bi_rw & REQ_HARDBARRIER) && \
+#define bio_empty_flush(bio) \
+ ((bio->bi_rw & REQ_FLUSH) && \
!bio_has_data(bio) && \
!(bio->bi_rw & REQ_DISCARD))

2010-08-13 14:54:01

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

On 08/13/2010 04:38 PM, Christoph Hellwig wrote:
> On Fri, Aug 13, 2010 at 03:48:59PM +0200, Tejun Heo wrote:
>> There are two reason to avoid changing the meaning of REQ_HARDBARRIER
>> and just deprecate it. One is to avoid breaking filesystems'
>> expectations underneath it. Please note that there are out-of-tree
>> filesystems too. I think it would be too dangerous to relax
>> REQ_HARDBARRIER.
>
> Note that the renaming patch would include a move from REQ_HARDBARRIER
> to REQ_FLUSH_FUA, so things just using REQ_HARDBARRIER will fail to
> compile. And while out of tree filesystems do exist they it's their
> problem to keep up with kernel changes. They decide not to be part
> of the Linux kernel, so it'll be their job to keep up with it.

Oh, right, we can simply remove REQ_HARDBARRIER completely.

>> Another is that pseudo block layer drivers (loop, virtio_blk,
>> md/dm...) have assumptions about REQ_HARDBARRIER behavior and things
>> would be broken in obscure ways between REQ_HARDBARRIER semantics
>> change and updates to each of those drivers, so I don't really think
>> changing the semantics while the mechanism is online is a good idea.
>
> I don't think doing those changes in a separate commit is a good idea.

Do you want to change the whole thing in a single commit? That would
be a pretty big invasive patch touching multiple subsystems. Also, I
don't know what to do about drdb and would like to leave its
conversion to the maintainer (in separate patches).

Eh, well, this is mostly logistics. Jens, what do you think?

>>> Then we can patches do disable the reiserfs barrier "optimization" as
>>> the very first one, and DM/MD support which I'm currently working on
>>> as the last one and we can start doing the heavy testing.
>>
>> Oops, I've already converted loop, virtio_blk/lguest and am working on
>> md/dm right now too. I'm almost done with md and now doing dm. :-)
>> Maybe we should post them right now so that we don't waste too much
>> time trying to solve the same problems?
>
> Here's the dm patch. It only handles normal bio based dm yet, which
> I understand and can test. request based dm (multipath) still needs
> work.

Here's the combined patch I've been working on. I've verified loop
and virtio_blk/loop. I just (like five mins ago) got dm/dm conversion
compiling, so I'm sure they're broken. The neat part is that thanks
to the separation between REQ_FLUSH and FUA handling, bio mangling
drivers only have to sequence the pre-flush and pass FUA directly to
lower layers which in many cases saves an array-wide cache flush
cycle.

After getting this patch working, the only remaining bits would be
blktrace and drdb.

Thanks.

Documentation/lguest/lguest.c | 36 +++-----
drivers/block/loop.c | 18 ++--
drivers/block/virtio_blk.c | 26 ++---
drivers/md/dm-io.c | 20 ----
drivers/md/dm-log.c | 2
drivers/md/dm-raid1.c | 8 -
drivers/md/dm-snap-persistent.c | 2
drivers/md/dm.c | 176 +++++++++++++++++++--------------------
drivers/md/linear.c | 4
drivers/md/md.c | 117 +++++---------------------
drivers/md/md.h | 23 +----
drivers/md/multipath.c | 4
drivers/md/raid0.c | 4
drivers/md/raid1.c | 178 +++++++++++++---------------------------
drivers/md/raid1.h | 2
drivers/md/raid10.c | 6 -
drivers/md/raid5.c | 18 +---
include/linux/virtio_blk.h | 6 +
18 files changed, 244 insertions(+), 406 deletions(-)

Index: block/drivers/block/loop.c
===================================================================
--- block.orig/drivers/block/loop.c
+++ block/drivers/block/loop.c
@@ -477,17 +477,17 @@ static int do_bio_filebacked(struct loop
pos = ((loff_t) bio->bi_sector << 9) + lo->lo_offset;

if (bio_rw(bio) == WRITE) {
- bool barrier = (bio->bi_rw & REQ_HARDBARRIER);
struct file *file = lo->lo_backing_file;

- if (barrier) {
- if (unlikely(!file->f_op->fsync)) {
- ret = -EOPNOTSUPP;
- goto out;
- }
+ /* REQ_HARDBARRIER is deprecated */
+ if (bio->bi_rw & REQ_HARDBARRIER) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }

+ if (bio->bi_rw & REQ_FLUSH) {
ret = vfs_fsync(file, 0);
- if (unlikely(ret)) {
+ if (unlikely(ret && ret != -EINVAL)) {
ret = -EIO;
goto out;
}
@@ -495,9 +495,9 @@ static int do_bio_filebacked(struct loop

ret = lo_send(lo, bio, pos);

- if (barrier && !ret) {
+ if ((bio->bi_rw & REQ_FUA) && !ret) {
ret = vfs_fsync(file, 0);
- if (unlikely(ret))
+ if (unlikely(ret && ret != -EINVAL))
ret = -EIO;
}
} else
Index: block/drivers/block/virtio_blk.c
===================================================================
--- block.orig/drivers/block/virtio_blk.c
+++ block/drivers/block/virtio_blk.c
@@ -128,9 +128,6 @@ static bool do_req(struct request_queue
}
}

- if (vbr->req->cmd_flags & REQ_HARDBARRIER)
- vbr->out_hdr.type |= VIRTIO_BLK_T_BARRIER;
-
sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));

/*
@@ -157,6 +154,8 @@ static bool do_req(struct request_queue
if (rq_data_dir(vbr->req) == WRITE) {
vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
out += num;
+ if (req->cmd_flags & REQ_FUA)
+ vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;
} else {
vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
in += num;
@@ -307,6 +306,7 @@ static int __devinit virtblk_probe(struc
{
struct virtio_blk *vblk;
struct request_queue *q;
+ unsigned int flush;
int err;
u64 cap;
u32 v, blk_size, sg_elems, opt_io_size;
@@ -388,15 +388,13 @@ static int __devinit virtblk_probe(struc
vblk->disk->driverfs_dev = &vdev->dev;
index++;

- /*
- * If the FLUSH feature is supported we do have support for
- * flushing a volatile write cache on the host. Use that to
- * implement write barrier support; otherwise, we must assume
- * that the host does not perform any kind of volatile write
- * caching.
- */
+ /* configure queue flush support */
+ flush = 0;
if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
- blk_queue_flush(q, REQ_FLUSH);
+ flush |= REQ_FLUSH;
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_FUA))
+ flush |= REQ_FUA;
+ blk_queue_flush(q, flush);

/* If disk is read-only in the host, the guest should obey */
if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
@@ -515,9 +513,9 @@ static const struct virtio_device_id id_
};

static unsigned int features[] = {
- VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
- VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
- VIRTIO_BLK_F_SCSI, VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY
+ VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
+ VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_SCSI,
+ VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_FUA,
};

/*
Index: block/include/linux/virtio_blk.h
===================================================================
--- block.orig/include/linux/virtio_blk.h
+++ block/include/linux/virtio_blk.h
@@ -16,6 +16,7 @@
#define VIRTIO_BLK_F_SCSI 7 /* Supports scsi command passthru */
#define VIRTIO_BLK_F_FLUSH 9 /* Cache flush command support */
#define VIRTIO_BLK_F_TOPOLOGY 10 /* Topology information is available */
+#define VIRTIO_BLK_F_FUA 11 /* Forced Unit Access write support */

#define VIRTIO_BLK_ID_BYTES 20 /* ID string length */

@@ -70,7 +71,10 @@ struct virtio_blk_config {
#define VIRTIO_BLK_T_FLUSH 4

/* Get device ID command */
-#define VIRTIO_BLK_T_GET_ID 8
+#define VIRTIO_BLK_T_GET_ID 8
+
+/* FUA command */
+#define VIRTIO_BLK_T_FUA 16

/* Barrier before this op. */
#define VIRTIO_BLK_T_BARRIER 0x80000000
Index: block/Documentation/lguest/lguest.c
===================================================================
--- block.orig/Documentation/lguest/lguest.c
+++ block/Documentation/lguest/lguest.c
@@ -1639,15 +1639,6 @@ static void blk_request(struct virtqueue
off = out->sector * 512;

/*
- * The block device implements "barriers", where the Guest indicates
- * that it wants all previous writes to occur before this write. We
- * don't have a way of asking our kernel to do a barrier, so we just
- * synchronize all the data in the file. Pretty poor, no?
- */
- if (out->type & VIRTIO_BLK_T_BARRIER)
- fdatasync(vblk->fd);
-
- /*
* In general the virtio block driver is allowed to try SCSI commands.
* It'd be nice if we supported eject, for example, but we don't.
*/
@@ -1679,6 +1670,19 @@ static void blk_request(struct virtqueue
/* Die, bad Guest, die. */
errx(1, "Write past end %llu+%u", off, ret);
}
+
+ /* Honor FUA by syncing everything. */
+ if (ret >= 0 && (out->type & VIRTIO_BLK_T_FUA)) {
+ ret = fdatasync(vblk->fd);
+ verbose("FUA fdatasync: %i\n", ret);
+ }
+
+ wlen = sizeof(*in);
+ *in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
+ } else if (out->type & VIRTIO_BLK_T_FLUSH) {
+ /* Flush */
+ ret = fdatasync(vblk->fd);
+ verbose("FLUSH fdatasync: %i\n", ret);
wlen = sizeof(*in);
*in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
} else {
@@ -1702,15 +1706,6 @@ static void blk_request(struct virtqueue
}
}

- /*
- * OK, so we noted that it was pretty poor to use an fdatasync as a
- * barrier. But Christoph Hellwig points out that we need a sync
- * *afterwards* as well: "Barriers specify no reordering to the front
- * or the back." And Jens Axboe confirmed it, so here we are:
- */
- if (out->type & VIRTIO_BLK_T_BARRIER)
- fdatasync(vblk->fd);
-
/* Finished that request. */
add_used(vq, head, wlen);
}
@@ -1735,8 +1730,9 @@ static void setup_block_file(const char
vblk->fd = open_or_die(filename, O_RDWR|O_LARGEFILE);
vblk->len = lseek64(vblk->fd, 0, SEEK_END);

- /* We support barriers. */
- add_feature(dev, VIRTIO_BLK_F_BARRIER);
+ /* We support FLUSH and FUA. */
+ add_feature(dev, VIRTIO_BLK_F_FLUSH);
+ add_feature(dev, VIRTIO_BLK_F_FUA);

/* Tell Guest how many sectors this device has. */
conf.capacity = cpu_to_le64(vblk->len / 512);
Index: block/drivers/md/linear.c
===================================================================
--- block.orig/drivers/md/linear.c
+++ block/drivers/md/linear.c
@@ -294,8 +294,8 @@ static int linear_make_request (mddev_t
dev_info_t *tmp_dev;
sector_t start_sector;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

Index: block/drivers/md/md.c
===================================================================
--- block.orig/drivers/md/md.c
+++ block/drivers/md/md.c
@@ -226,12 +226,12 @@ static int md_make_request(struct reques
return 0;
}
rcu_read_lock();
- if (mddev->suspended || mddev->barrier) {
+ if (mddev->suspended) {
DEFINE_WAIT(__wait);
for (;;) {
prepare_to_wait(&mddev->sb_wait, &__wait,
TASK_UNINTERRUPTIBLE);
- if (!mddev->suspended && !mddev->barrier)
+ if (!mddev->suspended)
break;
rcu_read_unlock();
schedule();
@@ -280,40 +280,29 @@ static void mddev_resume(mddev_t *mddev)

int mddev_congested(mddev_t *mddev, int bits)
{
- if (mddev->barrier)
- return 1;
return mddev->suspended;
}
EXPORT_SYMBOL(mddev_congested);

/*
- * Generic barrier handling for md
+ * Generic flush handling for md
*/

-#define POST_REQUEST_BARRIER ((void*)1)
-
-static void md_end_barrier(struct bio *bio, int err)
+static void md_end_flush(struct bio *bio, int err)
{
mdk_rdev_t *rdev = bio->bi_private;
mddev_t *mddev = rdev->mddev;
- if (err == -EOPNOTSUPP && mddev->barrier != POST_REQUEST_BARRIER)
- set_bit(BIO_EOPNOTSUPP, &mddev->barrier->bi_flags);

rdev_dec_pending(rdev, mddev);

if (atomic_dec_and_test(&mddev->flush_pending)) {
- if (mddev->barrier == POST_REQUEST_BARRIER) {
- /* This was a post-request barrier */
- mddev->barrier = NULL;
- wake_up(&mddev->sb_wait);
- } else
- /* The pre-request barrier has finished */
- schedule_work(&mddev->barrier_work);
+ /* The pre-request flush has finished */
+ schedule_work(&mddev->flush_work);
}
bio_put(bio);
}

-static void submit_barriers(mddev_t *mddev)
+static void submit_flushes(mddev_t *mddev)
{
mdk_rdev_t *rdev;

@@ -330,60 +319,56 @@ static void submit_barriers(mddev_t *mdd
atomic_inc(&rdev->nr_pending);
rcu_read_unlock();
bi = bio_alloc(GFP_KERNEL, 0);
- bi->bi_end_io = md_end_barrier;
+ bi->bi_end_io = md_end_flush;
bi->bi_private = rdev;
bi->bi_bdev = rdev->bdev;
atomic_inc(&mddev->flush_pending);
- submit_bio(WRITE_BARRIER, bi);
+ submit_bio(WRITE_FLUSH, bi);
rcu_read_lock();
rdev_dec_pending(rdev, mddev);
}
rcu_read_unlock();
}

-static void md_submit_barrier(struct work_struct *ws)
+static void md_submit_flush_data(struct work_struct *ws)
{
- mddev_t *mddev = container_of(ws, mddev_t, barrier_work);
- struct bio *bio = mddev->barrier;
+ mddev_t *mddev = container_of(ws, mddev_t, flush_work);
+ struct bio *bio = mddev->flush_bio;

atomic_set(&mddev->flush_pending, 1);

- if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags))
- bio_endio(bio, -EOPNOTSUPP);
- else if (bio->bi_size == 0)
+ if (bio->bi_size == 0)
/* an empty barrier - all done */
bio_endio(bio, 0);
else {
- bio->bi_rw &= ~REQ_HARDBARRIER;
+ bio->bi_rw &= ~REQ_FLUSH;
if (mddev->pers->make_request(mddev, bio))
generic_make_request(bio);
- mddev->barrier = POST_REQUEST_BARRIER;
- submit_barriers(mddev);
}
if (atomic_dec_and_test(&mddev->flush_pending)) {
- mddev->barrier = NULL;
+ mddev->flush_bio = NULL;
wake_up(&mddev->sb_wait);
}
}

-void md_barrier_request(mddev_t *mddev, struct bio *bio)
+void md_flush_request(mddev_t *mddev, struct bio *bio)
{
spin_lock_irq(&mddev->write_lock);
wait_event_lock_irq(mddev->sb_wait,
- !mddev->barrier,
+ !mddev->flush_bio,
mddev->write_lock, /*nothing*/);
- mddev->barrier = bio;
+ mddev->flush_bio = bio;
spin_unlock_irq(&mddev->write_lock);

atomic_set(&mddev->flush_pending, 1);
- INIT_WORK(&mddev->barrier_work, md_submit_barrier);
+ INIT_WORK(&mddev->flush_work, md_submit_flush_data);

- submit_barriers(mddev);
+ submit_flushes(mddev);

if (atomic_dec_and_test(&mddev->flush_pending))
- schedule_work(&mddev->barrier_work);
+ schedule_work(&mddev->flush_work);
}
-EXPORT_SYMBOL(md_barrier_request);
+EXPORT_SYMBOL(md_flush_request);

static inline mddev_t *mddev_get(mddev_t *mddev)
{
@@ -642,31 +627,6 @@ static void super_written(struct bio *bi
bio_put(bio);
}

-static void super_written_barrier(struct bio *bio, int error)
-{
- struct bio *bio2 = bio->bi_private;
- mdk_rdev_t *rdev = bio2->bi_private;
- mddev_t *mddev = rdev->mddev;
-
- if (!test_bit(BIO_UPTODATE, &bio->bi_flags) &&
- error == -EOPNOTSUPP) {
- unsigned long flags;
- /* barriers don't appear to be supported :-( */
- set_bit(BarriersNotsupp, &rdev->flags);
- mddev->barriers_work = 0;
- spin_lock_irqsave(&mddev->write_lock, flags);
- bio2->bi_next = mddev->biolist;
- mddev->biolist = bio2;
- spin_unlock_irqrestore(&mddev->write_lock, flags);
- wake_up(&mddev->sb_wait);
- bio_put(bio);
- } else {
- bio_put(bio2);
- bio->bi_private = rdev;
- super_written(bio, error);
- }
-}
-
void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
sector_t sector, int size, struct page *page)
{
@@ -675,51 +635,28 @@ void md_super_write(mddev_t *mddev, mdk_
* and decrement it on completion, waking up sb_wait
* if zero is reached.
* If an error occurred, call md_error
- *
- * As we might need to resubmit the request if REQ_HARDBARRIER
- * causes ENOTSUPP, we allocate a spare bio...
*/
struct bio *bio = bio_alloc(GFP_NOIO, 1);
- int rw = REQ_WRITE | REQ_SYNC | REQ_UNPLUG;

bio->bi_bdev = rdev->bdev;
bio->bi_sector = sector;
bio_add_page(bio, page, size, 0);
bio->bi_private = rdev;
bio->bi_end_io = super_written;
- bio->bi_rw = rw;

atomic_inc(&mddev->pending_writes);
- if (!test_bit(BarriersNotsupp, &rdev->flags)) {
- struct bio *rbio;
- rw |= REQ_HARDBARRIER;
- rbio = bio_clone(bio, GFP_NOIO);
- rbio->bi_private = bio;
- rbio->bi_end_io = super_written_barrier;
- submit_bio(rw, rbio);
- } else
- submit_bio(rw, bio);
+ submit_bio(REQ_WRITE | REQ_SYNC | REQ_UNPLUG | REQ_FLUSH | REQ_FUA,
+ bio);
}

void md_super_wait(mddev_t *mddev)
{
- /* wait for all superblock writes that were scheduled to complete.
- * if any had to be retried (due to BARRIER problems), retry them
- */
+ /* wait for all superblock writes that were scheduled to complete */
DEFINE_WAIT(wq);
for(;;) {
prepare_to_wait(&mddev->sb_wait, &wq, TASK_UNINTERRUPTIBLE);
if (atomic_read(&mddev->pending_writes)==0)
break;
- while (mddev->biolist) {
- struct bio *bio;
- spin_lock_irq(&mddev->write_lock);
- bio = mddev->biolist;
- mddev->biolist = bio->bi_next ;
- bio->bi_next = NULL;
- spin_unlock_irq(&mddev->write_lock);
- submit_bio(bio->bi_rw, bio);
- }
schedule();
}
finish_wait(&mddev->sb_wait, &wq);
@@ -1016,7 +953,6 @@ static int super_90_validate(mddev_t *md
clear_bit(Faulty, &rdev->flags);
clear_bit(In_sync, &rdev->flags);
clear_bit(WriteMostly, &rdev->flags);
- clear_bit(BarriersNotsupp, &rdev->flags);

if (mddev->raid_disks == 0) {
mddev->major_version = 0;
@@ -1431,7 +1367,6 @@ static int super_1_validate(mddev_t *mdd
clear_bit(Faulty, &rdev->flags);
clear_bit(In_sync, &rdev->flags);
clear_bit(WriteMostly, &rdev->flags);
- clear_bit(BarriersNotsupp, &rdev->flags);

if (mddev->raid_disks == 0) {
mddev->major_version = 1;
@@ -4463,7 +4398,6 @@ static int md_run(mddev_t *mddev)
/* may be over-ridden by personality */
mddev->resync_max_sectors = mddev->dev_sectors;

- mddev->barriers_work = 1;
mddev->ok_start_degraded = start_dirty_degraded;

if (start_readonly && mddev->ro == 0)
@@ -4638,7 +4572,6 @@ static void md_clean(mddev_t *mddev)
mddev->recovery = 0;
mddev->in_sync = 0;
mddev->degraded = 0;
- mddev->barriers_work = 0;
mddev->safemode = 0;
mddev->bitmap_info.offset = 0;
mddev->bitmap_info.default_offset = 0;
Index: block/drivers/md/md.h
===================================================================
--- block.orig/drivers/md/md.h
+++ block/drivers/md/md.h
@@ -67,7 +67,6 @@ struct mdk_rdev_s
#define Faulty 1 /* device is known to have a fault */
#define In_sync 2 /* device is in_sync with rest of array */
#define WriteMostly 4 /* Avoid reading if at all possible */
-#define BarriersNotsupp 5 /* REQ_HARDBARRIER is not supported */
#define AllReserved 6 /* If whole device is reserved for
* one array */
#define AutoDetected 7 /* added by auto-detect */
@@ -249,13 +248,6 @@ struct mddev_s
int degraded; /* whether md should consider
* adding a spare
*/
- int barriers_work; /* initialised to true, cleared as soon
- * as a barrier request to slave
- * fails. Only supported
- */
- struct bio *biolist; /* bios that need to be retried
- * because REQ_HARDBARRIER is not supported
- */

atomic_t recovery_active; /* blocks scheduled, but not written */
wait_queue_head_t recovery_wait;
@@ -308,16 +300,13 @@ struct mddev_s
struct list_head all_mddevs;

struct attribute_group *to_remove;
- /* Generic barrier handling.
- * If there is a pending barrier request, all other
- * writes are blocked while the devices are flushed.
- * The last to finish a flush schedules a worker to
- * submit the barrier request (without the barrier flag),
- * then submit more flush requests.
+ /* Generic flush handling.
+ * The last to finish preflush schedules a worker to submit
+ * the rest of the request (without the REQ_FLUSH flag).
*/
- struct bio *barrier;
+ struct bio *flush_bio;
atomic_t flush_pending;
- struct work_struct barrier_work;
+ struct work_struct flush_work;
};


@@ -458,7 +447,7 @@ extern void md_done_sync(mddev_t *mddev,
extern void md_error(mddev_t *mddev, mdk_rdev_t *rdev);

extern int mddev_congested(mddev_t *mddev, int bits);
-extern void md_barrier_request(mddev_t *mddev, struct bio *bio);
+extern void md_flush_request(mddev_t *mddev, struct bio *bio);
extern void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
sector_t sector, int size, struct page *page);
extern void md_super_wait(mddev_t *mddev);
Index: block/drivers/md/raid0.c
===================================================================
--- block.orig/drivers/md/raid0.c
+++ block/drivers/md/raid0.c
@@ -483,8 +483,8 @@ static int raid0_make_request(mddev_t *m
struct strip_zone *zone;
mdk_rdev_t *tmp_dev;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

Index: block/drivers/md/raid1.c
===================================================================
--- block.orig/drivers/md/raid1.c
+++ block/drivers/md/raid1.c
@@ -319,83 +319,74 @@ static void raid1_end_write_request(stru
if (r1_bio->bios[mirror] == bio)
break;

- if (error == -EOPNOTSUPP && test_bit(R1BIO_Barrier, &r1_bio->state)) {
- set_bit(BarriersNotsupp, &conf->mirrors[mirror].rdev->flags);
- set_bit(R1BIO_BarrierRetry, &r1_bio->state);
- r1_bio->mddev->barriers_work = 0;
- /* Don't rdev_dec_pending in this branch - keep it for the retry */
- } else {
+ /*
+ * 'one mirror IO has finished' event handler:
+ */
+ r1_bio->bios[mirror] = NULL;
+ to_put = bio;
+ if (!uptodate) {
+ md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
+ /* an I/O failed, we can't clear the bitmap */
+ set_bit(R1BIO_Degraded, &r1_bio->state);
+ } else
/*
- * this branch is our 'one mirror IO has finished' event handler:
+ * Set R1BIO_Uptodate in our master bio, so that we
+ * will return a good error code for to the higher
+ * levels even if IO on some other mirrored buffer
+ * fails.
+ *
+ * The 'master' represents the composite IO operation
+ * to user-side. So if something waits for IO, then it
+ * will wait for the 'master' bio.
*/
- r1_bio->bios[mirror] = NULL;
- to_put = bio;
- if (!uptodate) {
- md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
- /* an I/O failed, we can't clear the bitmap */
- set_bit(R1BIO_Degraded, &r1_bio->state);
- } else
- /*
- * Set R1BIO_Uptodate in our master bio, so that
- * we will return a good error code for to the higher
- * levels even if IO on some other mirrored buffer fails.
- *
- * The 'master' represents the composite IO operation to
- * user-side. So if something waits for IO, then it will
- * wait for the 'master' bio.
- */
- set_bit(R1BIO_Uptodate, &r1_bio->state);
+ set_bit(R1BIO_Uptodate, &r1_bio->state);
+
+ update_head_pos(mirror, r1_bio);

- update_head_pos(mirror, r1_bio);
+ if (behind) {
+ if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
+ atomic_dec(&r1_bio->behind_remaining);

- if (behind) {
- if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
- atomic_dec(&r1_bio->behind_remaining);
-
- /* In behind mode, we ACK the master bio once the I/O has safely
- * reached all non-writemostly disks. Setting the Returned bit
- * ensures that this gets done only once -- we don't ever want to
- * return -EIO here, instead we'll wait */
-
- if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
- test_bit(R1BIO_Uptodate, &r1_bio->state)) {
- /* Maybe we can return now */
- if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
- struct bio *mbio = r1_bio->master_bio;
- PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu\n",
- (unsigned long long) mbio->bi_sector,
- (unsigned long long) mbio->bi_sector +
- (mbio->bi_size >> 9) - 1);
- bio_endio(mbio, 0);
- }
+ /*
+ * In behind mode, we ACK the master bio once the I/O
+ * has safely reached all non-writemostly
+ * disks. Setting the Returned bit ensures that this
+ * gets done only once -- we don't ever want to return
+ * -EIO here, instead we'll wait
+ */
+ if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
+ test_bit(R1BIO_Uptodate, &r1_bio->state)) {
+ /* Maybe we can return now */
+ if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
+ struct bio *mbio = r1_bio->master_bio;
+ PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu\n",
+ (unsigned long long) mbio->bi_sector,
+ (unsigned long long) mbio->bi_sector +
+ (mbio->bi_size >> 9) - 1);
+ bio_endio(mbio, 0);
}
}
- rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
}
+ rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
+
/*
- *
* Let's see if all mirrored write operations have finished
* already.
*/
if (atomic_dec_and_test(&r1_bio->remaining)) {
- if (test_bit(R1BIO_BarrierRetry, &r1_bio->state))
- reschedule_retry(r1_bio);
- else {
- /* it really is the end of this request */
- if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
- /* free extra copy of the data pages */
- int i = bio->bi_vcnt;
- while (i--)
- safe_put_page(bio->bi_io_vec[i].bv_page);
- }
- /* clear the bitmap if all writes complete successfully */
- bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
- r1_bio->sectors,
- !test_bit(R1BIO_Degraded, &r1_bio->state),
- behind);
- md_write_end(r1_bio->mddev);
- raid_end_bio_io(r1_bio);
- }
+ if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
+ /* free extra copy of the data pages */
+ int i = bio->bi_vcnt;
+ while (i--)
+ safe_put_page(bio->bi_io_vec[i].bv_page);
+ }
+ /* clear the bitmap if all writes complete successfully */
+ bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
+ r1_bio->sectors,
+ !test_bit(R1BIO_Degraded, &r1_bio->state),
+ behind);
+ md_write_end(r1_bio->mddev);
+ raid_end_bio_io(r1_bio);
}

if (to_put)
@@ -787,17 +778,14 @@ static int make_request(mddev_t *mddev,
struct bio_list bl;
struct page **behind_pages = NULL;
const int rw = bio_data_dir(bio);
- const bool do_sync = (bio->bi_rw & REQ_SYNC);
- bool do_barriers;
+ const unsigned int do_sync = (bio->bi_rw & REQ_SYNC);
+ const unsigned int do_flush_fua = (bio->bi_rw & (REQ_FLUSH | REQ_FUA));
mdk_rdev_t *blocked_rdev;

/*
* Register the new request and wait if the reconstruction
* thread has put up a bar for new requests.
* Continue immediately if no resync is active currently.
- * We test barriers_work *after* md_write_start as md_write_start
- * may cause the first superblock write, and that will check out
- * if barriers work.
*/

md_write_start(mddev, bio); /* wait on superblock update early */
@@ -821,13 +809,6 @@ static int make_request(mddev_t *mddev,
}
finish_wait(&conf->wait_barrier, &w);
}
- if (unlikely(!mddev->barriers_work &&
- (bio->bi_rw & REQ_HARDBARRIER))) {
- if (rw == WRITE)
- md_write_end(mddev);
- bio_endio(bio, -EOPNOTSUPP);
- return 0;
- }

wait_barrier(conf);

@@ -877,7 +858,7 @@ static int make_request(mddev_t *mddev,
read_bio->bi_sector = r1_bio->sector + mirror->rdev->data_offset;
read_bio->bi_bdev = mirror->rdev->bdev;
read_bio->bi_end_io = raid1_end_read_request;
- read_bio->bi_rw = READ | do_sync;
+ read_bio->bi_rw = READ | do_sync | do_flush_fua;
read_bio->bi_private = r1_bio;

generic_make_request(read_bio);
@@ -959,10 +940,6 @@ static int make_request(mddev_t *mddev,
atomic_set(&r1_bio->remaining, 0);
atomic_set(&r1_bio->behind_remaining, 0);

- do_barriers = bio->bi_rw & REQ_HARDBARRIER;
- if (do_barriers)
- set_bit(R1BIO_Barrier, &r1_bio->state);
-
bio_list_init(&bl);
for (i = 0; i < disks; i++) {
struct bio *mbio;
@@ -975,7 +952,7 @@ static int make_request(mddev_t *mddev,
mbio->bi_sector = r1_bio->sector + conf->mirrors[i].rdev->data_offset;
mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
mbio->bi_end_io = raid1_end_write_request;
- mbio->bi_rw = WRITE | do_barriers | do_sync;
+ mbio->bi_rw = WRITE | do_sync;
mbio->bi_private = r1_bio;

if (behind_pages) {
@@ -1631,41 +1608,6 @@ static void raid1d(mddev_t *mddev)
if (test_bit(R1BIO_IsSync, &r1_bio->state)) {
sync_request_write(mddev, r1_bio);
unplug = 1;
- } else if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) {
- /* some requests in the r1bio were REQ_HARDBARRIER
- * requests which failed with -EOPNOTSUPP. Hohumm..
- * Better resubmit without the barrier.
- * We know which devices to resubmit for, because
- * all others have had their bios[] entry cleared.
- * We already have a nr_pending reference on these rdevs.
- */
- int i;
- const bool do_sync = (r1_bio->master_bio->bi_rw & REQ_SYNC);
- clear_bit(R1BIO_BarrierRetry, &r1_bio->state);
- clear_bit(R1BIO_Barrier, &r1_bio->state);
- for (i=0; i < conf->raid_disks; i++)
- if (r1_bio->bios[i])
- atomic_inc(&r1_bio->remaining);
- for (i=0; i < conf->raid_disks; i++)
- if (r1_bio->bios[i]) {
- struct bio_vec *bvec;
- int j;
-
- bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
- /* copy pages from the failed bio, as
- * this might be a write-behind device */
- __bio_for_each_segment(bvec, bio, j, 0)
- bvec->bv_page = bio_iovec_idx(r1_bio->bios[i], j)->bv_page;
- bio_put(r1_bio->bios[i]);
- bio->bi_sector = r1_bio->sector +
- conf->mirrors[i].rdev->data_offset;
- bio->bi_bdev = conf->mirrors[i].rdev->bdev;
- bio->bi_end_io = raid1_end_write_request;
- bio->bi_rw = WRITE | do_sync;
- bio->bi_private = r1_bio;
- r1_bio->bios[i] = bio;
- generic_make_request(bio);
- }
} else {
int disk;

Index: block/drivers/md/raid1.h
===================================================================
--- block.orig/drivers/md/raid1.h
+++ block/drivers/md/raid1.h
@@ -117,8 +117,6 @@ struct r1bio_s {
#define R1BIO_IsSync 1
#define R1BIO_Degraded 2
#define R1BIO_BehindIO 3
-#define R1BIO_Barrier 4
-#define R1BIO_BarrierRetry 5
/* For write-behind requests, we call bi_end_io when
* the last non-write-behind device completes, providing
* any write was successful. Otherwise we call when
Index: block/drivers/md/raid5.c
===================================================================
--- block.orig/drivers/md/raid5.c
+++ block/drivers/md/raid5.c
@@ -3278,7 +3278,7 @@ static void handle_stripe5(struct stripe

if (dec_preread_active) {
/* We delay this until after ops_run_io so that if make_request
- * is waiting on a barrier, it won't continue until the writes
+ * is waiting on a flush, it won't continue until the writes
* have actually been submitted.
*/
atomic_dec(&conf->preread_active_stripes);
@@ -3580,7 +3580,7 @@ static void handle_stripe6(struct stripe

if (dec_preread_active) {
/* We delay this until after ops_run_io so that if make_request
- * is waiting on a barrier, it won't continue until the writes
+ * is waiting on a flush, it won't continue until the writes
* have actually been submitted.
*/
atomic_dec(&conf->preread_active_stripes);
@@ -3958,14 +3958,8 @@ static int make_request(mddev_t *mddev,
const int rw = bio_data_dir(bi);
int remaining;

- if (unlikely(bi->bi_rw & REQ_HARDBARRIER)) {
- /* Drain all pending writes. We only really need
- * to ensure they have been submitted, but this is
- * easier.
- */
- mddev->pers->quiesce(mddev, 1);
- mddev->pers->quiesce(mddev, 0);
- md_barrier_request(mddev, bi);
+ if (unlikely(bi->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bi);
return 0;
}

@@ -4083,7 +4077,7 @@ static int make_request(mddev_t *mddev,
finish_wait(&conf->wait_for_overlap, &w);
set_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);
- if (mddev->barrier &&
+ if (mddev->flush_bio &&
!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
atomic_inc(&conf->preread_active_stripes);
release_stripe(sh);
@@ -4106,7 +4100,7 @@ static int make_request(mddev_t *mddev,
bio_endio(bi, 0);
}

- if (mddev->barrier) {
+ if (mddev->flush_bio) {
/* We need to wait for the stripes to all be handled.
* So: wait for preread_active_stripes to drop to 0.
*/
Index: block/drivers/md/multipath.c
===================================================================
--- block.orig/drivers/md/multipath.c
+++ block/drivers/md/multipath.c
@@ -142,8 +142,8 @@ static int multipath_make_request(mddev_
struct multipath_bh * mp_bh;
struct multipath_info *multipath;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

Index: block/drivers/md/raid10.c
===================================================================
--- block.orig/drivers/md/raid10.c
+++ block/drivers/md/raid10.c
@@ -799,13 +799,13 @@ static int make_request(mddev_t *mddev,
int i;
int chunk_sects = conf->chunk_mask + 1;
const int rw = bio_data_dir(bio);
- const bool do_sync = (bio->bi_rw & REQ_SYNC);
+ const unsigned int do_sync = (bio->bi_rw & REQ_SYNC);
struct bio_list bl;
unsigned long flags;
mdk_rdev_t *blocked_rdev;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

Index: block/drivers/md/dm-io.c
===================================================================
--- block.orig/drivers/md/dm-io.c
+++ block/drivers/md/dm-io.c
@@ -31,7 +31,6 @@ struct dm_io_client {
*/
struct io {
unsigned long error_bits;
- unsigned long eopnotsupp_bits;
atomic_t count;
struct task_struct *sleeper;
struct dm_io_client *client;
@@ -130,11 +129,8 @@ static void retrieve_io_and_region_from_
*---------------------------------------------------------------*/
static void dec_count(struct io *io, unsigned int region, int error)
{
- if (error) {
+ if (error)
set_bit(region, &io->error_bits);
- if (error == -EOPNOTSUPP)
- set_bit(region, &io->eopnotsupp_bits);
- }

if (atomic_dec_and_test(&io->count)) {
if (io->sleeper)
@@ -310,8 +306,8 @@ static void do_region(int rw, unsigned r
sector_t remaining = where->count;

/*
- * where->count may be zero if rw holds a write barrier and we
- * need to send a zero-sized barrier.
+ * where->count may be zero if rw holds a flush and we need to
+ * send a zero-sized flush.
*/
do {
/*
@@ -364,7 +360,7 @@ static void dispatch_io(int rw, unsigned
*/
for (i = 0; i < num_regions; i++) {
*dp = old_pages;
- if (where[i].count || (rw & REQ_HARDBARRIER))
+ if (where[i].count || (rw & REQ_FLUSH))
do_region(rw, i, where + i, dp, io);
}

@@ -393,9 +389,7 @@ static int sync_io(struct dm_io_client *
return -EIO;
}

-retry:
io->error_bits = 0;
- io->eopnotsupp_bits = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */
io->sleeper = current;
io->client = client;
@@ -412,11 +406,6 @@ retry:
}
set_current_state(TASK_RUNNING);

- if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
- rw &= ~REQ_HARDBARRIER;
- goto retry;
- }
-
if (error_bits)
*error_bits = io->error_bits;

@@ -437,7 +426,6 @@ static int async_io(struct dm_io_client

io = mempool_alloc(client->pool, GFP_NOIO);
io->error_bits = 0;
- io->eopnotsupp_bits = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */
io->sleeper = NULL;
io->client = client;
Index: block/drivers/md/dm-raid1.c
===================================================================
--- block.orig/drivers/md/dm-raid1.c
+++ block/drivers/md/dm-raid1.c
@@ -259,7 +259,7 @@ static int mirror_flush(struct dm_target
struct dm_io_region io[ms->nr_mirrors];
struct mirror *m;
struct dm_io_request io_req = {
- .bi_rw = WRITE_BARRIER,
+ .bi_rw = WRITE_FLUSH,
.mem.type = DM_IO_KMEM,
.mem.ptr.bvec = NULL,
.client = ms->io_client,
@@ -629,7 +629,7 @@ static void do_write(struct mirror_set *
struct dm_io_region io[ms->nr_mirrors], *dest = io;
struct mirror *m;
struct dm_io_request io_req = {
- .bi_rw = WRITE | (bio->bi_rw & WRITE_BARRIER),
+ .bi_rw = WRITE | (bio->bi_rw & (WRITE_FLUSH | WRITE_FUA)),
.mem.type = DM_IO_BVEC,
.mem.ptr.bvec = bio->bi_io_vec + bio->bi_idx,
.notify.fn = write_callback,
@@ -670,7 +670,7 @@ static void do_writes(struct mirror_set
bio_list_init(&requeue);

while ((bio = bio_list_pop(writes))) {
- if (unlikely(bio_empty_barrier(bio))) {
+ if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
bio_list_add(&sync, bio);
continue;
}
@@ -1203,7 +1203,7 @@ static int mirror_end_io(struct dm_targe
* We need to dec pending if this was a write.
*/
if (rw == WRITE) {
- if (likely(!bio_empty_barrier(bio)))
+ if (!(bio->bi_rw & REQ_FLUSH) || bio_has_data(bio))
dm_rh_dec(ms->rh, map_context->ll);
return error;
}
Index: block/drivers/md/dm.c
===================================================================
--- block.orig/drivers/md/dm.c
+++ block/drivers/md/dm.c
@@ -139,21 +139,21 @@ struct mapped_device {
spinlock_t deferred_lock;

/*
- * An error from the barrier request currently being processed.
+ * An error from the flush request currently being processed.
*/
- int barrier_error;
+ int flush_error;

/*
- * Protect barrier_error from concurrent endio processing
+ * Protect flush_error from concurrent endio processing
* in request-based dm.
*/
- spinlock_t barrier_error_lock;
+ spinlock_t flush_error_lock;

/*
- * Processing queue (flush/barriers)
+ * Processing queue (flush)
*/
struct workqueue_struct *wq;
- struct work_struct barrier_work;
+ struct work_struct flush_work;

/* A pointer to the currently processing pre/post flush request */
struct request *flush_request;
@@ -195,8 +195,8 @@ struct mapped_device {
/* sysfs handle */
struct kobject kobj;

- /* zero-length barrier that will be cloned and submitted to targets */
- struct bio barrier_bio;
+ /* zero-length flush that will be cloned and submitted to targets */
+ struct bio flush_bio;
};

/*
@@ -507,7 +507,7 @@ static void end_io_acct(struct dm_io *io

/*
* After this is decremented the bio must not be touched if it is
- * a barrier.
+ * a flush.
*/
dm_disk(md)->part0.in_flight[rw] = pending =
atomic_dec_return(&md->pending[rw]);
@@ -621,7 +621,7 @@ static void dec_pending(struct dm_io *io
*/
spin_lock_irqsave(&md->deferred_lock, flags);
if (__noflush_suspending(md)) {
- if (!(io->bio->bi_rw & REQ_HARDBARRIER))
+ if (!(io->bio->bi_rw & REQ_FLUSH))
bio_list_add_head(&md->deferred,
io->bio);
} else
@@ -633,14 +633,14 @@ static void dec_pending(struct dm_io *io
io_error = io->error;
bio = io->bio;

- if (bio->bi_rw & REQ_HARDBARRIER) {
+ if (bio->bi_rw & REQ_FLUSH) {
/*
- * There can be just one barrier request so we use
+ * There can be just one flush request so we use
* a per-device variable for error reporting.
* Note that you can't touch the bio after end_io_acct
*/
- if (!md->barrier_error && io_error != -EOPNOTSUPP)
- md->barrier_error = io_error;
+ if (!md->flush_error)
+ md->flush_error = io_error;
end_io_acct(io);
free_io(md, io);
} else {
@@ -744,21 +744,18 @@ static void end_clone_bio(struct bio *cl
blk_update_request(tio->orig, 0, nr_bytes);
}

-static void store_barrier_error(struct mapped_device *md, int error)
+static void store_flush_error(struct mapped_device *md, int error)
{
unsigned long flags;

- spin_lock_irqsave(&md->barrier_error_lock, flags);
+ spin_lock_irqsave(&md->flush_error_lock, flags);
/*
- * Basically, the first error is taken, but:
- * -EOPNOTSUPP supersedes any I/O error.
- * Requeue request supersedes any I/O error but -EOPNOTSUPP.
- */
- if (!md->barrier_error || error == -EOPNOTSUPP ||
- (md->barrier_error != -EOPNOTSUPP &&
- error == DM_ENDIO_REQUEUE))
- md->barrier_error = error;
- spin_unlock_irqrestore(&md->barrier_error_lock, flags);
+ * Basically, the first error is taken, but requeue request
+ * supersedes any I/O error.
+ */
+ if (!md->flush_error || error == DM_ENDIO_REQUEUE)
+ md->flush_error = error;
+ spin_unlock_irqrestore(&md->flush_error_lock, flags);
}

/*
@@ -799,12 +796,12 @@ static void dm_end_request(struct reques
{
int rw = rq_data_dir(clone);
int run_queue = 1;
- bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
+ bool is_flush = clone->cmd_flags & REQ_FLUSH;
struct dm_rq_target_io *tio = clone->end_io_data;
struct mapped_device *md = tio->md;
struct request *rq = tio->orig;

- if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+ if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {
rq->errors = clone->errors;
rq->resid_len = clone->resid_len;

@@ -819,12 +816,13 @@ static void dm_end_request(struct reques

free_rq_clone(clone);

- if (unlikely(is_barrier)) {
+ if (!is_flush)
+ blk_end_request_all(rq, error);
+ else {
if (unlikely(error))
- store_barrier_error(md, error);
+ store_flush_error(md, error);
run_queue = 0;
- } else
- blk_end_request_all(rq, error);
+ }

rq_completed(md, rw, run_queue);
}
@@ -851,9 +849,9 @@ void dm_requeue_unmapped_request(struct
struct request_queue *q = rq->q;
unsigned long flags;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+ if (clone->cmd_flags & REQ_FLUSH) {
/*
- * Barrier clones share an original request.
+ * Flush clones share an original request.
* Leave it to dm_end_request(), which handles this special
* case.
*/
@@ -950,14 +948,14 @@ static void dm_complete_request(struct r
struct dm_rq_target_io *tio = clone->end_io_data;
struct request *rq = tio->orig;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+ if (clone->cmd_flags & REQ_FLUSH) {
/*
- * Barrier clones share an original request. So can't use
+ * Flush clones share an original request. So can't use
* softirq_done with the original.
* Pass the clone to dm_done() directly in this special case.
* It is safe (even if clone->q->queue_lock is held here)
* because there is no I/O dispatching during the completion
- * of barrier clone.
+ * of flush clone.
*/
dm_done(clone, error, true);
return;
@@ -979,9 +977,9 @@ void dm_kill_unmapped_request(struct req
struct dm_rq_target_io *tio = clone->end_io_data;
struct request *rq = tio->orig;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+ if (clone->cmd_flags & REQ_FLUSH) {
/*
- * Barrier clones share an original request.
+ * Flush clones share an original request.
* Leave it to dm_end_request(), which handles this special
* case.
*/
@@ -1098,7 +1096,7 @@ static void dm_bio_destructor(struct bio
}

/*
- * Creates a little bio that is just does part of a bvec.
+ * Creates a little bio that is just a part of a bvec.
*/
static struct bio *split_bvec(struct bio *bio, sector_t sector,
unsigned short idx, unsigned int offset,
@@ -1113,7 +1111,7 @@ static struct bio *split_bvec(struct bio

clone->bi_sector = sector;
clone->bi_bdev = bio->bi_bdev;
- clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER;
+ clone->bi_rw = bio->bi_rw;
clone->bi_vcnt = 1;
clone->bi_size = to_bytes(len);
clone->bi_io_vec->bv_offset = offset;
@@ -1140,7 +1138,6 @@ static struct bio *clone_bio(struct bio

clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
__bio_clone(clone, bio);
- clone->bi_rw &= ~REQ_HARDBARRIER;
clone->bi_destructor = dm_bio_destructor;
clone->bi_sector = sector;
clone->bi_idx = idx;
@@ -1186,7 +1183,7 @@ static void __flush_target(struct clone_
__map_bio(ti, clone, tio);
}

-static int __clone_and_map_empty_barrier(struct clone_info *ci)
+static int __clone_and_map_flush(struct clone_info *ci)
{
unsigned target_nr = 0, flush_nr;
struct dm_target *ti;
@@ -1208,9 +1205,6 @@ static int __clone_and_map(struct clone_
sector_t len = 0, max;
struct dm_target_io *tio;

- if (unlikely(bio_empty_barrier(bio)))
- return __clone_and_map_empty_barrier(ci);
-
ti = dm_table_find_target(ci->map, ci->sector);
if (!dm_target_is_valid(ti))
return -EIO;
@@ -1308,11 +1302,11 @@ static void __split_and_process_bio(stru

ci.map = dm_get_live_table(md);
if (unlikely(!ci.map)) {
- if (!(bio->bi_rw & REQ_HARDBARRIER))
+ if (!(bio->bi_rw & REQ_FLUSH))
bio_io_error(bio);
else
- if (!md->barrier_error)
- md->barrier_error = -EIO;
+ if (!md->flush_error)
+ md->flush_error = -EIO;
return;
}

@@ -1325,14 +1319,22 @@ static void __split_and_process_bio(stru
ci.io->md = md;
spin_lock_init(&ci.io->endio_lock);
ci.sector = bio->bi_sector;
- ci.sector_count = bio_sectors(bio);
- if (unlikely(bio_empty_barrier(bio)))
+ if (!(bio->bi_rw & REQ_FLUSH))
+ ci.sector_count = bio_sectors(bio);
+ else {
+ /* FLUSH bio reaching here should all be empty */
+ WARN_ON_ONCE(bio_has_data(bio));
ci.sector_count = 1;
+ }
ci.idx = bio->bi_idx;

start_io_acct(ci.io);
- while (ci.sector_count && !error)
- error = __clone_and_map(&ci);
+ while (ci.sector_count && !error) {
+ if (!(bio->bi_rw & REQ_FLUSH))
+ error = __clone_and_map(&ci);
+ else
+ error = __clone_and_map_flush(&ci);
+ }

/* drop the extra reference count */
dec_pending(ci.io, error);
@@ -1417,11 +1419,11 @@ static int _dm_request(struct request_qu
part_stat_unlock();

/*
- * If we're suspended or the thread is processing barriers
+ * If we're suspended or the thread is processing flushes
* we have to queue this io for later.
*/
if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
- unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+ (bio->bi_rw & REQ_FLUSH)) {
up_read(&md->io_lock);

if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
@@ -1464,10 +1466,7 @@ static int dm_request(struct request_que

static bool dm_rq_is_flush_request(struct request *rq)
{
- if (rq->cmd_flags & REQ_FLUSH)
- return true;
- else
- return false;
+ return rq->cmd_flags & REQ_FLUSH;
}

void dm_dispatch_request(struct request *rq)
@@ -1520,7 +1519,7 @@ static int setup_clone(struct request *c
if (dm_rq_is_flush_request(rq)) {
blk_rq_init(NULL, clone);
clone->cmd_type = REQ_TYPE_FS;
- clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
+ clone->cmd_flags |= (REQ_FLUSH | WRITE);
} else {
r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
dm_rq_bio_constructor, tio);
@@ -1668,7 +1667,7 @@ static void dm_request_fn(struct request
BUG_ON(md->flush_request);
md->flush_request = rq;
blk_start_request(rq);
- queue_work(md->wq, &md->barrier_work);
+ queue_work(md->wq, &md->flush_work);
goto out;
}

@@ -1843,7 +1842,7 @@ out:
static const struct block_device_operations dm_blk_dops;

static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);
+static void dm_rq_flush_work(struct work_struct *work);

/*
* Allocate and initialise a blank device with a given minor.
@@ -1873,7 +1872,7 @@ static struct mapped_device *alloc_dev(i
init_rwsem(&md->io_lock);
mutex_init(&md->suspend_lock);
spin_lock_init(&md->deferred_lock);
- spin_lock_init(&md->barrier_error_lock);
+ spin_lock_init(&md->flush_error_lock);
rwlock_init(&md->map_lock);
atomic_set(&md->holders, 1);
atomic_set(&md->open_count, 0);
@@ -1918,7 +1917,7 @@ static struct mapped_device *alloc_dev(i
atomic_set(&md->pending[1], 0);
init_waitqueue_head(&md->wait);
INIT_WORK(&md->work, dm_wq_work);
- INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
+ INIT_WORK(&md->flush_work, dm_rq_flush_work);
init_waitqueue_head(&md->eventq);

md->disk->major = _major;
@@ -2233,31 +2232,28 @@ static int dm_wait_for_completion(struct
return r;
}

-static void dm_flush(struct mapped_device *md)
+static void process_flush(struct mapped_device *md, struct bio *bio)
{
+ md->flush_error = 0;
+
+ /* handle REQ_FLUSH */
dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);

- bio_init(&md->barrier_bio);
- md->barrier_bio.bi_bdev = md->bdev;
- md->barrier_bio.bi_rw = WRITE_BARRIER;
- __split_and_process_bio(md, &md->barrier_bio);
+ bio_init(&md->flush_bio);
+ md->flush_bio.bi_bdev = md->bdev;
+ md->flush_bio.bi_rw = WRITE_FLUSH;
+ __split_and_process_bio(md, &md->flush_bio);

dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-}
-
-static void process_barrier(struct mapped_device *md, struct bio *bio)
-{
- md->barrier_error = 0;

- dm_flush(md);
+ bio->bi_rw &= ~REQ_FLUSH;

- if (!bio_empty_barrier(bio)) {
+ /* handle data + REQ_FUA */
+ if (bio_has_data(bio))
__split_and_process_bio(md, bio);
- dm_flush(md);
- }

- if (md->barrier_error != DM_ENDIO_REQUEUE)
- bio_endio(bio, md->barrier_error);
+ if (md->flush_error != DM_ENDIO_REQUEUE)
+ bio_endio(bio, md->flush_error);
else {
spin_lock_irq(&md->deferred_lock);
bio_list_add_head(&md->deferred, bio);
@@ -2291,8 +2287,8 @@ static void dm_wq_work(struct work_struc
if (dm_request_based(md))
generic_make_request(c);
else {
- if (c->bi_rw & REQ_HARDBARRIER)
- process_barrier(md, c);
+ if (c->bi_rw & REQ_FLUSH)
+ process_flush(md, c);
else
__split_and_process_bio(md, c);
}
@@ -2317,8 +2313,8 @@ static void dm_rq_set_flush_nr(struct re
tio->info.flush_request = flush_nr;
}

-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
+/* Issue flush requests to targets and wait for their completion. */
+static int dm_rq_flush(struct mapped_device *md)
{
int i, j;
struct dm_table *map = dm_get_live_table(md);
@@ -2326,7 +2322,7 @@ static int dm_rq_barrier(struct mapped_d
struct dm_target *ti;
struct request *clone;

- md->barrier_error = 0;
+ md->flush_error = 0;

for (i = 0; i < num_targets; i++) {
ti = dm_table_get_target(map, i);
@@ -2341,26 +2337,26 @@ static int dm_rq_barrier(struct mapped_d
dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
dm_table_put(map);

- return md->barrier_error;
+ return md->flush_error;
}

-static void dm_rq_barrier_work(struct work_struct *work)
+static void dm_rq_flush_work(struct work_struct *work)
{
int error;
struct mapped_device *md = container_of(work, struct mapped_device,
- barrier_work);
+ flush_work);
struct request_queue *q = md->queue;
struct request *rq;
unsigned long flags;

/*
* Hold the md reference here and leave it at the last part so that
- * the md can't be deleted by device opener when the barrier request
+ * the md can't be deleted by device opener when the flush request
* completes.
*/
dm_get(md);

- error = dm_rq_barrier(md);
+ error = dm_rq_flush(md);

rq = md->flush_request;
md->flush_request = NULL;
@@ -2520,7 +2516,7 @@ int dm_suspend(struct mapped_device *md,
up_write(&md->io_lock);

/*
- * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
+ * Request-based dm uses md->wq for flush (dm_rq_flush_work) which
* can be kicked until md->queue is stopped. So stop md->queue before
* flushing md->wq.
*/
Index: block/drivers/md/dm-log.c
===================================================================
--- block.orig/drivers/md/dm-log.c
+++ block/drivers/md/dm-log.c
@@ -300,7 +300,7 @@ static int flush_header(struct log_c *lc
.count = 0,
};

- lc->io_req.bi_rw = WRITE_BARRIER;
+ lc->io_req.bi_rw = WRITE_FLUSH;

return dm_io(&lc->io_req, 1, &null_location, NULL);
}
Index: block/drivers/md/dm-snap-persistent.c
===================================================================
--- block.orig/drivers/md/dm-snap-persistent.c
+++ block/drivers/md/dm-snap-persistent.c
@@ -687,7 +687,7 @@ static void persistent_commit_exception(
/*
* Commit exceptions to disk.
*/
- if (ps->valid && area_io(ps, WRITE_BARRIER))
+ if (ps->valid && area_io(ps, WRITE_FLUSH_FUA))
ps->valid = 0;

/*

--
tejun

2010-08-14 10:37:28

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Fri, Aug 13, 2010 at 04:51:17PM +0200, Tejun Heo wrote:
> Do you want to change the whole thing in a single commit? That would
> be a pretty big invasive patch touching multiple subsystems.

We can just stop draining in the block layer in the first patch, then
stop doing the stuff in md/dm/etc in the following and then do the
final renaming patches. It would still be less patches then now, but
keep things working through the whole transition, which would really
help biseting any problems.

> + if (req->cmd_flags & REQ_FUA)
> + vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;

I'd suggest not adding FUA support to virtio yet. Just using the flush
feature gives you a fully working barrier implementation.

Eventually we might want to add a flag in the block queue to send
REQ_FLUSH|REQ_FUA request through to virtio directly so that we can
avoid separate pre- and post flushes, but I really want to benchmark if
it makes an impact on real life setups first.

> Index: block/drivers/md/linear.c
> ===================================================================
> --- block.orig/drivers/md/linear.c
> +++ block/drivers/md/linear.c
> @@ -294,8 +294,8 @@ static int linear_make_request (mddev_t
> dev_info_t *tmp_dev;
> sector_t start_sector;
>
> - if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
> - md_barrier_request(mddev, bio);
> + if (unlikely(bio->bi_rw & REQ_FLUSH)) {
> + md_flush_request(mddev, bio);

We only need the special md_flush_request handling for
empty REQ_FLUSH requests. REQ_WRITE | REQ_FLUSH just need the
flag propagated to the underlying devices.

> +static void md_end_flush(struct bio *bio, int err)
> {
> mdk_rdev_t *rdev = bio->bi_private;
> mddev_t *mddev = rdev->mddev;
>
> rdev_dec_pending(rdev, mddev);
>
> if (atomic_dec_and_test(&mddev->flush_pending)) {
> + /* The pre-request flush has finished */
> + schedule_work(&mddev->flush_work);

Once we only handle empty barriers here we can directly call bio_endio
instead of first scheduling a work queue.Once we only handle empty
barriers here we can directly call bio_endio and the super wakeup
instead of first scheduling a work queue.

> while ((bio = bio_list_pop(writes))) {
> - if (unlikely(bio_empty_barrier(bio))) {
> + if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {

I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
useful macro for the bio based drivers.

> @@ -621,7 +621,7 @@ static void dec_pending(struct dm_io *io
> */
> spin_lock_irqsave(&md->deferred_lock, flags);
> if (__noflush_suspending(md)) {
> - if (!(io->bio->bi_rw & REQ_HARDBARRIER))
> + if (!(io->bio->bi_rw & REQ_FLUSH))

I suspect we don't actually need to special case flushes here anymore.


> @@ -633,14 +633,14 @@ static void dec_pending(struct dm_io *io
> io_error = io->error;
> bio = io->bio;
>
> - if (bio->bi_rw & REQ_HARDBARRIER) {
> + if (bio->bi_rw & REQ_FLUSH) {
> /*
> - * There can be just one barrier request so we use
> + * There can be just one flush request so we use
> * a per-device variable for error reporting.
> * Note that you can't touch the bio after end_io_acct
> */
> - if (!md->barrier_error && io_error != -EOPNOTSUPP)
> - md->barrier_error = io_error;
> + if (!md->flush_error)
> + md->flush_error = io_error;

And we certainly do not need any special casing here. See my patch.

> {
> int rw = rq_data_dir(clone);
> int run_queue = 1;
> - bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
> + bool is_flush = clone->cmd_flags & REQ_FLUSH;
> struct dm_rq_target_io *tio = clone->end_io_data;
> struct mapped_device *md = tio->md;
> struct request *rq = tio->orig;
>
> - if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
> + if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {

We never send flush requests as REQ_TYPE_BLOCK_PC anymore, so no need
for the second half of this conditional.

> + if (!is_flush)
> + blk_end_request_all(rq, error);
> + else {
> if (unlikely(error))
> - store_barrier_error(md, error);
> + store_flush_error(md, error);
> run_queue = 0;
> - } else
> - blk_end_request_all(rq, error);
> + }

Flush requests can now be completed normally.

> @@ -1308,11 +1302,11 @@ static void __split_and_process_bio(stru
>
> ci.map = dm_get_live_table(md);
> if (unlikely(!ci.map)) {
> - if (!(bio->bi_rw & REQ_HARDBARRIER))
> + if (!(bio->bi_rw & REQ_FLUSH))
> bio_io_error(bio);
> else
> - if (!md->barrier_error)
> - md->barrier_error = -EIO;
> + if (!md->flush_error)
> + md->flush_error = -EIO;

No need for the special error handling here, flush requests can now
be completed normally.

> @@ -1417,11 +1419,11 @@ static int _dm_request(struct request_qu
> part_stat_unlock();
>
> /*
> - * If we're suspended or the thread is processing barriers
> + * If we're suspended or the thread is processing flushes
> * we have to queue this io for later.
> */
> if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
> - unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
> + (bio->bi_rw & REQ_FLUSH)) {
> up_read(&md->io_lock);

AFAICS this is only needed for the old barrier code, no need for this
for pure flushes.

> @@ -1464,10 +1466,7 @@ static int dm_request(struct request_que
>
> static bool dm_rq_is_flush_request(struct request *rq)
> {
> - if (rq->cmd_flags & REQ_FLUSH)
> - return true;
> - else
> - return false;
> + return rq->cmd_flags & REQ_FLUSH;
> }

It's probably worth just killing this wrapper.


> void dm_dispatch_request(struct request *rq)
> @@ -1520,7 +1519,7 @@ static int setup_clone(struct request *c
> if (dm_rq_is_flush_request(rq)) {
> blk_rq_init(NULL, clone);
> clone->cmd_type = REQ_TYPE_FS;
> - clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
> + clone->cmd_flags |= (REQ_FLUSH | WRITE);
> } else {
> r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
> dm_rq_bio_constructor, tio);

My suspicion is that we can get rif of all that special casing here
and just use blk_rq_prep_clone once it's been updated to propagate
REQ_FLUSH, similar to the DISCARD flag.

I also suspect that there is absolutely no need to the barrier work
queue once we stop waiting for outstanding request. But then again
the request based dm code still somewhat confuses me.

> +static void process_flush(struct mapped_device *md, struct bio *bio)
> {
> + md->flush_error = 0;
> +
> + /* handle REQ_FLUSH */
> dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
>
> - bio_init(&md->barrier_bio);
> - md->barrier_bio.bi_bdev = md->bdev;
> - md->barrier_bio.bi_rw = WRITE_BARRIER;
> - __split_and_process_bio(md, &md->barrier_bio);
> + bio_init(&md->flush_bio);
> + md->flush_bio.bi_bdev = md->bdev;
> + md->flush_bio.bi_rw = WRITE_FLUSH;
> + __split_and_process_bio(md, &md->flush_bio);

There's not need to use a separate flush_bio here.
__split_and_process_bio does the right thing for empty REQ_FLUSH
requests. See my patch for how to do this differenty. And yeah,
my version has been tested.

2010-08-16 16:36:39

by Tejun Heo

[permalink] [raw]
Subject: [PATCH UPDATED 10/11] fs, block: propagate REQ_FLUSH/FUA interface to upper layers

Propagate deprecation of REQ_HARDBARRIER and new REQ_FLUSH/FUA
interface to upper layers.

* WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
WRITE_FLUSH_FUA are added.

* REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
copied from bio to request.

* BH_Ordered and BH_Eopnotsupp are marked deprecated. BH_Flush/FUA
are _NOT_ added as they can and should be specified when calling
submit_bh() as @rw parameter as suggested by Jan Kara.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jan Kara <[email protected]>
---
Dropped BH_Flush/FUA as suggested.

Thanks.

include/linux/blk_types.h | 2 +-
include/linux/buffer_head.h | 4 ++--
include/linux/fs.h | 20 +++++++++++++-------
3 files changed, 16 insertions(+), 10 deletions(-)

Index: block/include/linux/fs.h
===================================================================
--- block.orig/include/linux/fs.h
+++ block/include/linux/fs.h
@@ -138,13 +138,13 @@ struct inodes_stat_t {
* SWRITE_SYNC
* SWRITE_SYNC_PLUG Like WRITE_SYNC/WRITE_SYNC_PLUG, but locks the buffer.
* See SWRITE.
- * WRITE_BARRIER Like WRITE_SYNC, but tells the block layer that all
- * previously submitted writes must be safely on storage
- * before this one is started. Also guarantees that when
- * this write is complete, it itself is also safely on
- * storage. Prevents reordering of writes on both sides
- * of this IO.
- *
+ * WRITE_BARRIER DEPRECATED. Always fails. Use FLUSH/FUA instead.
+ * WRITE_FLUSH Like WRITE_SYNC but with preceding cache flush.
+ * WRITE_FUA Like WRITE_SYNC but data is guaranteed to be on
+ * non-volatile media on completion.
+ * WRITE_FLUSH_FUA Combination of WRITE_FLUSH and FUA. The IO is preceded
+ * by a cache flush and data is guaranteed to be on
+ * non-volatile media on completion.
*/
#define RW_MASK REQ_WRITE
#define RWA_MASK REQ_RAHEAD
@@ -162,6 +162,12 @@ struct inodes_stat_t {
#define WRITE_META (WRITE | REQ_META)
#define WRITE_BARRIER (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
REQ_HARDBARRIER)
+#define WRITE_FLUSH (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
+ REQ_FLUSH)
+#define WRITE_FUA (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
+ REQ_FUA)
+#define WRITE_FLUSH_FUA (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
+ REQ_FLUSH | REQ_FUA)
#define SWRITE_SYNC_PLUG (SWRITE | REQ_SYNC | REQ_NOIDLE)
#define SWRITE_SYNC (SWRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG)

Index: block/include/linux/blk_types.h
===================================================================
--- block.orig/include/linux/blk_types.h
+++ block/include/linux/blk_types.h
@@ -164,7 +164,7 @@ enum rq_flag_bits {
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
#define REQ_COMMON_MASK \
(REQ_WRITE | REQ_FAILFAST_MASK | REQ_HARDBARRIER | REQ_SYNC | \
- REQ_META| REQ_DISCARD | REQ_NOIDLE)
+ REQ_META | REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)

#define REQ_UNPLUG (1 << __REQ_UNPLUG)
#define REQ_RAHEAD (1 << __REQ_RAHEAD)
Index: block/include/linux/buffer_head.h
===================================================================
--- block.orig/include/linux/buffer_head.h
+++ block/include/linux/buffer_head.h
@@ -32,8 +32,8 @@ enum bh_state_bits {
BH_Delay, /* Buffer is not yet allocated on disk */
BH_Boundary, /* Block is followed by a discontiguity */
BH_Write_EIO, /* I/O error on write */
- BH_Ordered, /* ordered write */
- BH_Eopnotsupp, /* operation not supported (barrier) */
+ BH_Ordered, /* DEPRECATED: ordered write */
+ BH_Eopnotsupp, /* DEPRECATED: operation not supported (barrier) */
BH_Unwritten, /* Buffer is allocated on disk but not written */
BH_Quiet, /* Buffer Error Prinks to be quiet */

2010-08-17 10:03:11

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello, Christoph.

On 08/14/2010 12:36 PM, Christoph Hellwig wrote:
> On Fri, Aug 13, 2010 at 04:51:17PM +0200, Tejun Heo wrote:
>> Do you want to change the whole thing in a single commit? That would
>> be a pretty big invasive patch touching multiple subsystems.
>
> We can just stop draining in the block layer in the first patch, then
> stop doing the stuff in md/dm/etc in the following and then do the
> final renaming patches. It would still be less patches then now, but
> keep things working through the whole transition, which would really
> help biseting any problems.

I'm not really convinced that would help much. If bisecting can point
to the conversion as the culprit for whatever kind of failure,
wouldn't that be enough? No matter what we do the conversion will be
a single step thing. If we make the filesystems enforce the ordering
first and then relax ordering in the block layer, bisection would
still just point at the later patch. The same goes for md/dm, the
best we can find out would be whether the conversion is correct or not
anyway.

I'm not against restructuring the patchset if it makes more sense but
it just feels like it would be a bit pointless effort (and one which
would require much tighter coordination among different trees) at this
point. Am I missing something?

>> + if (req->cmd_flags & REQ_FUA)
>> + vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;
>
> I'd suggest not adding FUA support to virtio yet. Just using the flush
> feature gives you a fully working barrier implementation.
>
> Eventually we might want to add a flag in the block queue to send
> REQ_FLUSH|REQ_FUA request through to virtio directly so that we can
> avoid separate pre- and post flushes, but I really want to benchmark if
> it makes an impact on real life setups first.

I wrote this in the other mail but I think it would make difference if
the backend storag is md/dm especially if it's shared by multiple VMs.
It cuts down on one array wide cache flush.

>> Index: block/drivers/md/linear.c
>> ===================================================================
>> --- block.orig/drivers/md/linear.c
>> +++ block/drivers/md/linear.c
>> @@ -294,8 +294,8 @@ static int linear_make_request (mddev_t
>> dev_info_t *tmp_dev;
>> sector_t start_sector;
>>
>> - if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
>> - md_barrier_request(mddev, bio);
>> + if (unlikely(bio->bi_rw & REQ_FLUSH)) {
>> + md_flush_request(mddev, bio);
>
> We only need the special md_flush_request handling for
> empty REQ_FLUSH requests. REQ_WRITE | REQ_FLUSH just need the
> flag propagated to the underlying devices.

Hmm, not really, the WRITE should happen after all the data in cache
are committed to NV media, meaning that empty FLUSH should already
have finished by the time the WRITE starts.

>> +static void md_end_flush(struct bio *bio, int err)
>> {
>> mdk_rdev_t *rdev = bio->bi_private;
>> mddev_t *mddev = rdev->mddev;
>>
>> rdev_dec_pending(rdev, mddev);
>>
>> if (atomic_dec_and_test(&mddev->flush_pending)) {
>> + /* The pre-request flush has finished */
>> + schedule_work(&mddev->flush_work);
>
> Once we only handle empty barriers here we can directly call bio_endio
> instead of first scheduling a work queue.Once we only handle empty
> barriers here we can directly call bio_endio and the super wakeup
> instead of first scheduling a work queue.

Yeap, right. That would be a nice optimization.

>> while ((bio = bio_list_pop(writes))) {
>> - if (unlikely(bio_empty_barrier(bio))) {
>> + if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
>
> I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
> useful macro for the bio based drivers.

Hmm... maybe. The reason why I removed bio_empty_flush() was that
except for the front-most sequencer (block layer for all the request
based ones and the front-most make_request for bio based ones), it
doesn't make sense to see REQ_FLUSH + data bios. They should be
sequenced at the front-most stage anyway, so I didn't have much use
for them. Those code paths couldn't deal with REQ_FLUSH + data bios
anyway.

>> @@ -621,7 +621,7 @@ static void dec_pending(struct dm_io *io
>> */
>> spin_lock_irqsave(&md->deferred_lock, flags);
>> if (__noflush_suspending(md)) {
>> - if (!(io->bio->bi_rw & REQ_HARDBARRIER))
>> + if (!(io->bio->bi_rw & REQ_FLUSH))
>
> I suspect we don't actually need to special case flushes here anymore.

Oh, I'm not sure about this part at all. I'll ask Mike.

>> @@ -633,14 +633,14 @@ static void dec_pending(struct dm_io *io
>> io_error = io->error;
>> bio = io->bio;
>>
>> - if (bio->bi_rw & REQ_HARDBARRIER) {
>> + if (bio->bi_rw & REQ_FLUSH) {
>> /*
>> - * There can be just one barrier request so we use
>> + * There can be just one flush request so we use
>> * a per-device variable for error reporting.
>> * Note that you can't touch the bio after end_io_acct
>> */
>> - if (!md->barrier_error && io_error != -EOPNOTSUPP)
>> - md->barrier_error = io_error;
>> + if (!md->flush_error)
>> + md->flush_error = io_error;
>
> And we certainly do not need any special casing here. See my patch.

I wasn't sure about that part. You removed store_flush_error(), but
DM_ENDIO_REQUEUE should still have higher priority than other
failures, no?

>> {
>> int rw = rq_data_dir(clone);
>> int run_queue = 1;
>> - bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
>> + bool is_flush = clone->cmd_flags & REQ_FLUSH;
>> struct dm_rq_target_io *tio = clone->end_io_data;
>> struct mapped_device *md = tio->md;
>> struct request *rq = tio->orig;
>>
>> - if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
>> + if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {
>
> We never send flush requests as REQ_TYPE_BLOCK_PC anymore, so no need
> for the second half of this conditional.

I see.

>> + if (!is_flush)
>> + blk_end_request_all(rq, error);
>> + else {
>> if (unlikely(error))
>> - store_barrier_error(md, error);
>> + store_flush_error(md, error);
>> run_queue = 0;
>> - } else
>> - blk_end_request_all(rq, error);
>> + }
>
> Flush requests can now be completed normally.

The same question as before. I think we still need to prioritize
DM_ENDIO_REQUEUE failures.

>> @@ -1417,11 +1419,11 @@ static int _dm_request(struct request_qu
>> part_stat_unlock();
>>
>> /*
>> - * If we're suspended or the thread is processing barriers
>> + * If we're suspended or the thread is processing flushes
>> * we have to queue this io for later.
>> */
>> if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
>> - unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
>> + (bio->bi_rw & REQ_FLUSH)) {
>> up_read(&md->io_lock);
>
> AFAICS this is only needed for the old barrier code, no need for this
> for pure flushes.

I'll ask Mike.

>> @@ -1464,10 +1466,7 @@ static int dm_request(struct request_que
>>
>> static bool dm_rq_is_flush_request(struct request *rq)
>> {
>> - if (rq->cmd_flags & REQ_FLUSH)
>> - return true;
>> - else
>> - return false;
>> + return rq->cmd_flags & REQ_FLUSH;
>> }
>
> It's probably worth just killing this wrapper.

Yeah, probably. It was an accidental edit to begin with and I left
this part out in the new patch.

>> +static void process_flush(struct mapped_device *md, struct bio *bio)
>> {
>> + md->flush_error = 0;
>> +
>> + /* handle REQ_FLUSH */
>> dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
>>
>> - bio_init(&md->barrier_bio);
>> - md->barrier_bio.bi_bdev = md->bdev;
>> - md->barrier_bio.bi_rw = WRITE_BARRIER;
>> - __split_and_process_bio(md, &md->barrier_bio);
>> + bio_init(&md->flush_bio);
>> + md->flush_bio.bi_bdev = md->bdev;
>> + md->flush_bio.bi_rw = WRITE_FLUSH;
>> + __split_and_process_bio(md, &md->flush_bio);
>
> There's not need to use a separate flush_bio here.
> __split_and_process_bio does the right thing for empty REQ_FLUSH
> requests. See my patch for how to do this differenty. And yeah,
> my version has been tested.

But how do you make sure REQ_FLUSHes for preflush finish before
starting the write?

Thanks.

--
tejun

2010-08-17 13:19:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Tue, Aug 17, 2010 at 11:59:38AM +0200, Tejun Heo wrote:
> I'm not really convinced that would help much. If bisecting can point
> to the conversion as the culprit for whatever kind of failure,
> wouldn't that be enough? No matter what we do the conversion will be
> a single step thing. If we make the filesystems enforce the ordering
> first and then relax ordering in the block layer, bisection would
> still just point at the later patch. The same goes for md/dm, the
> best we can find out would be whether the conversion is correct or not
> anyway.

The filesystems already enforce the ordering, except reiserfs which
opts out if the barrier options is set.

> I'm not against restructuring the patchset if it makes more sense but
> it just feels like it would be a bit pointless effort (and one which
> would require much tighter coordination among different trees) at this
> point. Am I missing something?

What other trees do you mean? The conversions of the 8 filesystems
that actually support barriers need to go through this tree anyway
if we want to be able to test it. Also the changes in the filesystem
are absolutely minimal - it's basically just
s/WRITE_BARRIER/WRITE_FUA_FLUSH/ after my initial patch kill BH_Orderd,
and removing about 10 lines of code in reiserfs.

> > We only need the special md_flush_request handling for
> > empty REQ_FLUSH requests. REQ_WRITE | REQ_FLUSH just need the
> > flag propagated to the underlying devices.
>
> Hmm, not really, the WRITE should happen after all the data in cache
> are committed to NV media, meaning that empty FLUSH should already
> have finished by the time the WRITE starts.

You're right.

> >> while ((bio = bio_list_pop(writes))) {
> >> - if (unlikely(bio_empty_barrier(bio))) {
> >> + if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
> >
> > I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
> > useful macro for the bio based drivers.
>
> Hmm... maybe. The reason why I removed bio_empty_flush() was that
> except for the front-most sequencer (block layer for all the request
> based ones and the front-most make_request for bio based ones), it
> doesn't make sense to see REQ_FLUSH + data bios. They should be
> sequenced at the front-most stage anyway, so I didn't have much use
> for them. Those code paths couldn't deal with REQ_FLUSH + data bios
> anyway.

The current bio_empty_barrier is only used in dm, and indeed only makes
sense for make_request-based drivers. But I think it's a rather useful
helper for them. Either way, it's not a big issue and either way is
fine with me.

> >> + if (bio->bi_rw & REQ_FLUSH) {
> >> /*
> >> - * There can be just one barrier request so we use
> >> + * There can be just one flush request so we use
> >> * a per-device variable for error reporting.
> >> * Note that you can't touch the bio after end_io_acct
> >> */
> >> - if (!md->barrier_error && io_error != -EOPNOTSUPP)
> >> - md->barrier_error = io_error;
> >> + if (!md->flush_error)
> >> + md->flush_error = io_error;
> >
> > And we certainly do not need any special casing here. See my patch.
>
> I wasn't sure about that part. You removed store_flush_error(), but
> DM_ENDIO_REQUEUE should still have higher priority than other
> failures, no?

Which priority?

> >> +static void process_flush(struct mapped_device *md, struct bio *bio)
> >> {
> >> + md->flush_error = 0;
> >> +
> >> + /* handle REQ_FLUSH */
> >> dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
> >>
> >> - bio_init(&md->barrier_bio);
> >> - md->barrier_bio.bi_bdev = md->bdev;
> >> - md->barrier_bio.bi_rw = WRITE_BARRIER;
> >> - __split_and_process_bio(md, &md->barrier_bio);
> >> + bio_init(&md->flush_bio);
> >> + md->flush_bio.bi_bdev = md->bdev;
> >> + md->flush_bio.bi_rw = WRITE_FLUSH;
> >> + __split_and_process_bio(md, &md->flush_bio);
> >
> > There's not need to use a separate flush_bio here.
> > __split_and_process_bio does the right thing for empty REQ_FLUSH
> > requests. See my patch for how to do this differenty. And yeah,
> > my version has been tested.
>
> But how do you make sure REQ_FLUSHes for preflush finish before
> starting the write?

Hmm, okay. I see how the special flush_bio makes the waiting easier,
let's see if Mike or other in the DM team have a better idea.

2010-08-17 13:27:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 08/11] block: rename barrier/ordered to flush

> -#define blk_queue_flushing(q) ((q)->ordseq)
> +#define blk_queue_flushing(q) ((q)->flush_seq)

Btw, I think this one should just go away. It's only used by
ide in an attempt to make ordered sequences atomic, which isn't
needed for the new design.

2010-08-17 16:27:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 08/11] block: rename barrier/ordered to flush

Hello,

On 08/17/2010 03:26 PM, Christoph Hellwig wrote:
>> -#define blk_queue_flushing(q) ((q)->ordseq)
>> +#define blk_queue_flushing(q) ((q)->flush_seq)
>
> Btw, I think this one should just go away. It's only used by
> ide in an attempt to make ordered sequences atomic, which isn't
> needed for the new design.

Yeap, agreed. I couldn't really understand why the the sequence
needed to be atomic for ide in the first place so just left it alone.
Do you understand why it tried to be atomic?

Thanks.

--
tejun

2010-08-17 16:45:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hi,

On 08/17/2010 03:19 PM, Christoph Hellwig wrote:
> On Tue, Aug 17, 2010 at 11:59:38AM +0200, Tejun Heo wrote:
>> I'm not against restructuring the patchset if it makes more sense but
>> it just feels like it would be a bit pointless effort (and one which
>> would require much tighter coordination among different trees) at this
>> point. Am I missing something?
>
> What other trees do you mean?

I was mostly thinking about dm/md, drdb and stuff, but you're talking
about filesystem conversion patches being routed through block tree,
right?

> The conversions of the 8 filesystems that actually support barriers
> need to go through this tree anyway if we want to be able to test
> it. Also the changes in the filesystem are absolutely minimal -
> it's basically just s/WRITE_BARRIER/WRITE_FUA_FLUSH/ after my
> initial patch kill BH_Orderd, and removing about 10 lines of code in
> reiserfs.

I might just resequence it to finish this part of discussion but what
does that really buy us? It's not really gonna help bisection.
Bisection won't be able to tell anything in higher resolution than
"the new implementation doesn't work". If you show me how it would
actually help, I'll happily reshuffle the patches.

>> I wasn't sure about that part. You removed store_flush_error(), but
>> DM_ENDIO_REQUEUE should still have higher priority than other
>> failures, no?
>
> Which priority?

IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
core layer to retry the whole bio later), it trumps all other failures
and the bio is retried later. That was why DM_ENDIO_REQUEUE was
prioritized over other error codes, which actually is sort of
incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
layers as FLUSH failure implies data already lost. So,
DM_ENDIO_REQUEUE actually should have lower priority than other
failures. But, then again, the error codes still need to be
prioritized.

>> But how do you make sure REQ_FLUSHes for preflush finish before
>> starting the write?
>
> Hmm, okay. I see how the special flush_bio makes the waiting easier,
> let's see if Mike or other in the DM team have a better idea.

Yeah, it would be better if it can be sequenced w/o using a work but
let's leave it for later.

Thanks.

--
tejun

2010-08-17 17:00:03

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Tue, Aug 17, 2010 at 06:41:47PM +0200, Tejun Heo wrote:
> > What other trees do you mean?
>
> I was mostly thinking about dm/md, drdb and stuff, but you're talking
> about filesystem conversion patches being routed through block tree,
> right?

I think we really need all the conversions in one tree, block layer,
remapping drivers and filesystems.

Btw, I've done the conversion for all filesystems and I'm running tests
over them now. Expect the series late today or tomorrow.

> I might just resequence it to finish this part of discussion but what
> does that really buy us? It's not really gonna help bisection.
> Bisection won't be able to tell anything in higher resolution than
> "the new implementation doesn't work". If you show me how it would
> actually help, I'll happily reshuffle the patches.

It's not bisecting to find bugs in the barrier conversion. We can't
easily bisect it down anyway. The problem is when we try to bisect
other problems and get into the middle of the series barriers suddenly
are gone. Which is not very helpful for things like data integrity
problems in filesystems.

> >> I wasn't sure about that part. You removed store_flush_error(), but
> >> DM_ENDIO_REQUEUE should still have higher priority than other
> >> failures, no?
> >
> > Which priority?
>
> IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
> core layer to retry the whole bio later), it trumps all other failures
> and the bio is retried later. That was why DM_ENDIO_REQUEUE was
> prioritized over other error codes, which actually is sort of
> incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
> layers as FLUSH failure implies data already lost. So,
> DM_ENDIO_REQUEUE actually should have lower priority than other
> failures. But, then again, the error codes still need to be
> prioritized.

I think that's something we better leave to the DM team.

2010-08-17 17:10:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 08/11] block: rename barrier/ordered to flush

On Tue, Aug 17, 2010 at 06:23:55PM +0200, Tejun Heo wrote:
> Yeap, agreed. I couldn't really understand why the the sequence
> needed to be atomic for ide in the first place so just left it alone.
> Do you understand why it tried to be atomic?

I think initial drafs of the barrier specification talked about atomic
sequences. Except for that I can't think of any reason.

2010-08-18 06:23:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 08/11] block: rename barrier/ordered to flush

Hello,

On 08/17/2010 07:08 PM, Christoph Hellwig wrote:
> On Tue, Aug 17, 2010 at 06:23:55PM +0200, Tejun Heo wrote:
>> Yeap, agreed. I couldn't really understand why the the sequence
>> needed to be atomic for ide in the first place so just left it alone.
>> Do you understand why it tried to be atomic?
>
> I think initial drafs of the barrier specification talked about atomic
> sequences. Except for that I can't think of any reason.

Hmm... alright, I'll rip it out.

Thanks.

--
tejun

2010-08-18 06:38:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

On 08/17/2010 06:59 PM, Christoph Hellwig wrote:
> I think we really need all the conversions in one tree, block layer,
> remapping drivers and filesystems.

I don't know. If filesystem changes are really trivial maybe, but
md/dm changes seem a bit too invasive to go through the block tree.

> Btw, I've done the conversion for all filesystems and I'm running tests
> over them now. Expect the series late today or tomorrow.

Cool. :-)

>> I might just resequence it to finish this part of discussion but what
>> does that really buy us? It's not really gonna help bisection.
>> Bisection won't be able to tell anything in higher resolution than
>> "the new implementation doesn't work". If you show me how it would
>> actually help, I'll happily reshuffle the patches.
>
> It's not bisecting to find bugs in the barrier conversion. We can't
> easily bisect it down anyway. The problem is when we try to bisect
> other problems and get into the middle of the series barriers suddenly
> are gone. Which is not very helpful for things like data integrity
> problems in filesystems.

Ah, okay, hmmm.... alright, I'll resequence the patches. If the
filesystem changes can be put into a single tree somehow, we can keep
things mostly working at least for direct devices.

>> IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
>> core layer to retry the whole bio later), it trumps all other failures
>> and the bio is retried later. That was why DM_ENDIO_REQUEUE was
>> prioritized over other error codes, which actually is sort of
>> incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
>> layers as FLUSH failure implies data already lost. So,
>> DM_ENDIO_REQUEUE actually should have lower priority than other
>> failures. But, then again, the error codes still need to be
>> prioritized.
>
> I think that's something we better leave to the DM team.

Sure, but we shouldn't be ripping out the code to do that.

Thanks.

--
tejun

2010-08-18 08:11:01

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

On 08/18/2010 08:35 AM, Tejun Heo wrote:
>> It's not bisecting to find bugs in the barrier conversion. We can't
>> easily bisect it down anyway. The problem is when we try to bisect
>> other problems and get into the middle of the series barriers suddenly
>> are gone. Which is not very helpful for things like data integrity
>> problems in filesystems.
>
> Ah, okay, hmmm.... alright, I'll resequence the patches. If the
> filesystem changes can be put into a single tree somehow, we can keep
> things mostly working at least for direct devices.

Sorry but I'm doing it. It just doesn't make much sense. I can't
relax the ordering for REQ_HARDBARRIER without breaking the remapping
drivers. So, to keep things working, I'll have to 1. relax the
ordering 2. implement new REQ_FLUSH/FUA based interface and 3. use
them in the filesystems in the same patch. That's just wrong. And I
don't think md/dm changes can or should go through the block tree.
They're way too invasive for that. It's a new implementation and
barrier won't work (fail gracefully) for several commits during the
transition. I don't think there's a better way around it.

Thanks.

--
tejun

2010-08-18 09:47:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

FYI: One issue with this series is that make_request based drivers
not have to access all REQ_FLUSH and REQ_FUA requests. We'll either
need to add handling to empty REQ_FLUSH requests to all of them or
figure out a way to prevent them getting sent. That is assuming they'll
simply ignore REQ_FLUSH/REQ_FUA on normal writes.

2010-08-18 19:29:05

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Christoph Hellwig, on 08/13/2010 05:17 PM wrote:
> As far as playing with ordered tags it's just adding a new flag for
> it on the bio that gets passed down to the driver. For a final version
> you'd need a queue-level feature if it's supported, but you don't
> even need that for the initial work. Then you can implement a
> variant of blk_do_flush that does away with queueing additional requests
> once finish but queues all two or three at the same time with your
> new ordered flag set, at which point you are back to the level or
> ordered tag usage that the old code allows. You're still left with
> all the hard problems of actually implementing error handling for it
> and using it higher up in the filesystem and generic page cache code.

But how about file systems doing internal local order-by-drain? Without
converting them to use ordered commands it would be impossible to show
full potential of them and to make the conversion one would need deep
internal FS knowledge. That's my point. But if there's a trivial way to
see all such places in the filesystems code and convert, then OK, I agree.

> I'd really love to see your results, up to the point of just trying
> that once I get a little spare time. But my theory is that it won't
> help us - the problem with ordered tags is that they enforce global
> ordering while we currently have local ordering. While it will reduce
> the latency for the process waiting for an fsync or similar it will
> affect other I/O going on in the background and reduce the devices
> ability to reorder that I/O.

The local ordering vs global ordering is relevant only if you have
several applications/threads load. But how about a single
application/thread?

Another point, for which, AFAIU, the ORDERED commands were invented, is
that they make ordering on the _another_ side of the link _after_ all
link/transfer latencies. This is why it's hard to see advantage of them
on local disks.

Vlad

2010-08-18 19:30:41

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

Tejun Heo, on 08/13/2010 05:21 PM wrote:
>> If requested, I can develop the interface further.
>
> I still think the benefit of ordering by tag would be marginal at
> best, and what have you guys measured there? Under the current
> framework, there's no easy way to measure full ordered-by-tag
> implementation. The mechanism for filesystems to communicate the
> ordering information (which would be a partially ordered graph) just
> isn't there and there is no way the current usage of ordering-by-tag
> only for barrier sequence can achieve anything close to that level of
> difference.

Basically, I measured how iSCSI link utilization depends from amount of
queued commands and queued data size. This is why I made it as a table.
From it you can see which improvement you will have removing queue
draining after 1, 2, 4, etc. commands depending of commands sizes.

For instance, on my previous XFS rm example, where rm of 4 files took
3.5 minutes with nobarrier option, I could see that XFS was sending 1-3
32K commands in a row. From my table you can see that if it sent all
them at once without draining, it would have about 150-200% speed increase.

Vlad

2010-08-19 09:55:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

On 08/18/2010 09:30 PM, Vladislav Bolkhovitin wrote:
> Basically, I measured how iSCSI link utilization depends from amount
> of queued commands and queued data size. This is why I made it as a
> table. From it you can see which improvement you will have removing
> queue draining after 1, 2, 4, etc. commands depending of commands
> sizes.
>
> For instance, on my previous XFS rm example, where rm of 4 files
> took 3.5 minutes with nobarrier option, I could see that XFS was
> sending 1-3 32K commands in a row. From my table you can see that if
> it sent all them at once without draining, it would have about
> 150-200% speed increase.

You compared barrier off/on. Of course, it will make a big
difference. I think good part of that gain should be realized by the
currently proposed patchset which removes draining. What's needed to
be demonstrated is the difference between ordered-by-waiting and
ordered-by-tag. We've never had code to do that properly.

The original ordered-by-tag we had only applied tag ordering to two or
three command sequences inside a barrier, which doesn't amount to much
(and could even be harmful as it imposes draining of all simple
commands inside the device only to reduce issue latencies for a few
commands). You'll need to hook into filesystem and somehow export the
ordering information down to the driver so that whatever needs
ordering is sent out as ordered commands.

As I've wrote multiple times, I'm pretty skeptical it will bring much.
Ordered tag mandates draining inside the device just like the original
barrier implementation. Sure, it's done at a lower layer and command
issue latencies will be reduced thanks to that but ordered-by-waiting
doesn't require _any_ draining at all. The whole pipeline can be kept
full all the time. I'm often wrong tho, so please feel free to go
ahead and prove me wrong. :-)

Thanks.

--
tejun

2010-08-19 10:01:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

On 08/18/2010 11:46 AM, Christoph Hellwig wrote:
> FYI: One issue with this series is that make_request based drivers
> not have to access all REQ_FLUSH and REQ_FUA requests. We'll either
> need to add handling to empty REQ_FLUSH requests to all of them or
> figure out a way to prevent them getting sent. That is assuming they'll
> simply ignore REQ_FLUSH/REQ_FUA on normal writes.

Can you be a bit more specific? In most cases, request based drivers
should be fine. They sit behind the front most request_queue which
would discompose REQ_FLUSH/FUAs into appropriate command sequence.
For the request based drivers, it's not different from the original
REQ_HARDBARRIER mechanism, it'll just see flushes and optionally FUA
writes.

Thanks.

--
tejun

2010-08-19 10:20:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Thu, Aug 19, 2010 at 11:57:53AM +0200, Tejun Heo wrote:
> On 08/18/2010 11:46 AM, Christoph Hellwig wrote:
> > FYI: One issue with this series is that make_request based drivers
> > not have to access all REQ_FLUSH and REQ_FUA requests. We'll either
> > need to add handling to empty REQ_FLUSH requests to all of them or
> > figure out a way to prevent them getting sent. That is assuming they'll
> > simply ignore REQ_FLUSH/REQ_FUA on normal writes.
>
> Can you be a bit more specific? In most cases, request based drivers
> should be fine. They sit behind the front most request_queue which
> would discompose REQ_FLUSH/FUAs into appropriate command sequence.

I said make_request based drivers, that is drivers taking bios. These
get bios directly from __generic_make_request and need to deal with
REQ_FLUSH/FUA themselves. We have quite a few more than just dm/md of
this kind:

arch/powerpc/sysdev/axonram.c: blk_queue_make_request(bank->disk->queue, axon_ram_make_request);
drivers/block/aoe/aoeblk.c: blk_queue_make_request(d->blkq, aoeblk_make_request);
drivers/block/brd.c: blk_queue_make_request(brd->brd_queue, brd_make_request);
drivers/block/drbd/drbd_main.c: blk_queue_make_request(q, drbd_make_request_26);
drivers/block/loop.c: blk_queue_make_request(lo->lo_queue, loop_make_request);
drivers/block/pktcdvd.c: blk_queue_make_request(q, pkt_make_request);
drivers/block/ps3vram.c: blk_queue_make_request(queue, ps3vram_make_request);
drivers/block/umem.c: blk_queue_make_request(card->queue, mm_make_request);
drivers/s390/block/dcssblk.c: blk_queue_make_request(dev_info->dcssblk_queue, dcssblk_make_request);
drivers/s390/block/xpram.c: blk_queue_make_request(xpram_queues[i], xpram_make_request);
drivers/staging/zram/zram_drv.c:blk_queue_make_request(zram->queue, zram_make_request);

2010-08-19 10:26:52

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

On 08/19/2010 12:20 PM, Christoph Hellwig wrote:
> I said make_request based drivers, that is drivers taking bios.

Right. Gees, it's confusing.

> These get bios directly from __generic_make_request and need to deal
> with REQ_FLUSH/FUA themselves. We have quite a few more than just
> dm/md of this kind:
>
> arch/powerpc/sysdev/axonram.c
> drivers/block/aoe/aoeblk.c
> drivers/block/brd.c

I'll try to convert these three.

> drivers/block/drbd/drbd_main.c

I'd rather leave drbd to its maintainers.

> drivers/block/loop.c

Already converted.

> drivers/block/pktcdvd.c
> drivers/block/ps3vram.c
> drivers/block/umem.c
> drivers/s390/block/dcssblk.c
> drivers/s390/block/xpram.c
> drivers/staging/zram/zram_drv.c

Will work on these.

Thanks.

--
tejun

2010-08-20 08:27:55

by Kiyoshi Ueda

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hi Tejun, Christoph,

On Tue, Aug 17, 2010 at 06:41:47PM +0200, Tejun Heo wrote:
>>> I wasn't sure about that part. You removed store_flush_error(), but
>>> DM_ENDIO_REQUEUE should still have higher priority than other
>>> failures, no?
>>
>> Which priority?
>
> IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
> core layer to retry the whole bio later), it trumps all other failures
> and the bio is retried later. That was why DM_ENDIO_REQUEUE was
> prioritized over other error codes, which actually is sort of
> incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
> layers as FLUSH failure implies data already lost. So,
> DM_ENDIO_REQUEUE actually should have lower priority than other
> failures. But, then again, the error codes still need to be
> prioritized.

I think that's correct and changing the priority of DM_ENDIO_REQUEUE
for REQ_FLUSH down to the lowest should be fine.
(I didn't know that FLUSH failure implies data loss possibility.)

But the patch is not enough, you have to change target drivers, too.
E.g. As for multipath, you need to change
drivers/md/dm-mpath.c:do_end_io() to return error for REQ_FLUSH
like the REQ_DISCARD support included in 2.6.36-rc1.


By the way, if these patch-set with the change above are included,
even one path failure for REQ_FLUSH on multipath configuration will
be reported to upper layer as error, although it's retried using
other paths currently.
Then, if an upper layer won't take correct recovery action for the error,
it would be seen as a regression for users. (e.g. Frequent EXT3-error
resulting in read-only mount on multipath configuration.)

Although I think the explicit error is fine rather than implicit data
corruption, please check upper layers carefully so that users won't see
such errors as much as possible.

Thanks,
Kiyoshi Ueda

2010-08-20 13:22:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

FYI: here's a little writeup to document the new cache flushing scheme,
intended to replace Documentation/block/barriers.txt. Any good
suggestion for a filename in the kernel tree?

---

Explicit volatile write cache control
=====================================

Introduction
------------

Many storage devices, especially in the consumer market, come with volatile
write back caches. That means the devices signal I/O completion to the
operating system before data actually has hit the physical medium. This
behavior obviously speeds up various workloads, but it means the operating
system needs to force data out to the physical medium when it performs
a data integrity operation like fsync, sync or an unmount.

The Linux block layer provides a two simple mechanism that lets filesystems
control the caching behavior of the storage device. These mechanisms are
a forced cache flush, and the Force Unit Access (FUA) flag for requests.


Explicit cache flushes
----------------------

The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure the volatile cache of the storage device
has been flushed before the actual I/O operation is started. The explicit
guarantees write requests that have completed before the bio was submitted
actually are on the physical medium before this request has started.
In addition the REQ_FLUSH flag can be set on an otherwise empty bio
structure, which causes only an explicit cache flush without any dependent
I/O. It is recommend to use the blkdev_issue_flush() helper for a pure
cache flush.


Forced Unit Access
-----------------

The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure that I/O completion for this requests is not
signaled before the data has made it to non-volatile storage on the
physical medium.


Implementation details for filesystems
--------------------------------------

Filesystem can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
worry if the underlying devices need any explicit cache flushing and how
the Forced Unit Access is implemented. The REQ_FLUSH and REQ_FUA flags
may both be set on a single bio.


Implementation details for make_request_fn based block drivers
--------------------------------------------------------------

These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
directly below the submit_bio interface. For remapping drivers the REQ_FUA
bits needs to be propagate to underlying devices, and a global flush needs
to be implemented for bios with the REQ_FLUSH bit set. For real device
drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
on non-empty bios can simply be ignored, and REQ_FLUSH requests without
data can be completed successfully without doing any work. Drivers for
devices with volatile caches need to implement the support for these
flags themselves without any help from the block layer.


Implementation details for request_fn based block drivers
--------------------------------------------------------------

For devices that do not support volatile write caches there is no driver
support required, the block layer completes empty REQ_FLUSH requests before
entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
requests that have a payload. For device with volatile write caches the
driver needs to tell the block layer that it supports flushing caches by
doing:

blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);

and handle empty REQ_FLUSH requests in it's prep_fn/request_fn. Note that
REQ_FLUSH requests with a payload are automatically turned into a sequence
of empty REQ_FLUSH and the actual write by the block layer. For devices
that also support the FUA bit the block layer needs to be told to pass
through that bit using:

blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);

and handle write requests that have the REQ_FUA bit set properly in it's
prep_fn/request_fn. If the FUA bit is not natively supported the block
layer turns it into an empty REQ_FLUSH requests after the actual write.

2010-08-20 15:18:30

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On 08/20/2010 09:22 AM, Christoph Hellwig wrote:
> FYI: here's a little writeup to document the new cache flushing scheme,
> intended to replace Documentation/block/barriers.txt. Any good
> suggestion for a filename in the kernel tree?
>
> ---

I was thinking that we might be better off using the "durable writes" term more
since it is well documented (at least in the database world, where it is the "D"
in ACID properties). Maybe "durable_writes_support.txt" ?


>
> Explicit volatile write cache control
> =====================================
>
> Introduction
> ------------
>
> Many storage devices, especially in the consumer market, come with volatile
> write back caches. That means the devices signal I/O completion to the
> operating system before data actually has hit the physical medium. This
> behavior obviously speeds up various workloads, but it means the operating
> system needs to force data out to the physical medium when it performs
> a data integrity operation like fsync, sync or an unmount.
>
> The Linux block layer provides a two simple mechanism that lets filesystems
> control the caching behavior of the storage device. These mechanisms are
> a forced cache flush, and the Force Unit Access (FUA) flag for requests.
>

Should we mention that users can also disable the write cache on the target device?

It might also be worth mentioning that storage needs to be properly configured -
i.e., an internal hardware RAID card with battery backing needs can expose
itself as a writethrough cache *only if* it actually has control over all of the
backend disks and can flush/disable their write caches.

Maybe that is too much detail, but I know that people have lost data with some
of these setups.

The rest of the write up below sounds good, thanks for pulling this together!

Ric


>
> Explicit cache flushes
> ----------------------
>
> The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from the
> filesystem and will make sure the volatile cache of the storage device
> has been flushed before the actual I/O operation is started. The explicit
> guarantees write requests that have completed before the bio was submitted
> actually are on the physical medium before this request has started.
> In addition the REQ_FLUSH flag can be set on an otherwise empty bio
> structure, which causes only an explicit cache flush without any dependent
> I/O. It is recommend to use the blkdev_issue_flush() helper for a pure
> cache flush.
>
>
> Forced Unit Access
> -----------------
>
> The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
> filesystem and will make sure that I/O completion for this requests is not
> signaled before the data has made it to non-volatile storage on the
> physical medium.
>
>
> Implementation details for filesystems
> --------------------------------------
>
> Filesystem can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
> worry if the underlying devices need any explicit cache flushing and how
> the Forced Unit Access is implemented. The REQ_FLUSH and REQ_FUA flags
> may both be set on a single bio.
>
>
> Implementation details for make_request_fn based block drivers
> --------------------------------------------------------------
>
> These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
> directly below the submit_bio interface. For remapping drivers the REQ_FUA
> bits needs to be propagate to underlying devices, and a global flush needs
> to be implemented for bios with the REQ_FLUSH bit set. For real device
> drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
> on non-empty bios can simply be ignored, and REQ_FLUSH requests without
> data can be completed successfully without doing any work. Drivers for
> devices with volatile caches need to implement the support for these
> flags themselves without any help from the block layer.
>
>
> Implementation details for request_fn based block drivers
> --------------------------------------------------------------
>
> For devices that do not support volatile write caches there is no driver
> support required, the block layer completes empty REQ_FLUSH requests before
> entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
> requests that have a payload. For device with volatile write caches the
> driver needs to tell the block layer that it supports flushing caches by
> doing:
>
> blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);
>
> and handle empty REQ_FLUSH requests in it's prep_fn/request_fn. Note that
> REQ_FLUSH requests with a payload are automatically turned into a sequence
> of empty REQ_FLUSH and the actual write by the block layer. For devices
> that also support the FUA bit the block layer needs to be told to pass
> through that bit using:
>
> blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);
>
> and handle write requests that have the REQ_FUA bit set properly in it's
> prep_fn/request_fn. If the FUA bit is not natively supported the block
> layer turns it into an empty REQ_FLUSH requests after the actual write.

2010-08-20 16:01:40

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Fri, Aug 20, 2010 at 11:18:07AM -0400, Ric Wheeler wrote:
> On 08/20/2010 09:22 AM, Christoph Hellwig wrote:
> >FYI: here's a little writeup to document the new cache flushing scheme,
> >intended to replace Documentation/block/barriers.txt. Any good
> >suggestion for a filename in the kernel tree?
> >
> >---
>
> I was thinking that we might be better off using the "durable
> writes" term more since it is well documented (at least in the
> database world, where it is the "D" in ACID properties). Maybe
> "durable_writes_support.txt" ?

sata_lies.txt?

Ok, maybe writeback_cache.txt?

-chris

2010-08-20 16:02:55

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On 08/20/2010 12:00 PM, Chris Mason wrote:
> On Fri, Aug 20, 2010 at 11:18:07AM -0400, Ric Wheeler wrote:
>> On 08/20/2010 09:22 AM, Christoph Hellwig wrote:
>>> FYI: here's a little writeup to document the new cache flushing scheme,
>>> intended to replace Documentation/block/barriers.txt. Any good
>>> suggestion for a filename in the kernel tree?
>>>
>>> ---
>>
>> I was thinking that we might be better off using the "durable
>> writes" term more since it is well documented (at least in the
>> database world, where it is the "D" in ACID properties). Maybe
>> "durable_writes_support.txt" ?
>
> sata_lies.txt?
>
> Ok, maybe writeback_cache.txt?
>
> -chris

writeback_cache.txt is certainly the least confusing :)

ric

2010-08-23 12:19:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

On 08/20/2010 10:26 AM, Kiyoshi Ueda wrote:
> I think that's correct and changing the priority of DM_ENDIO_REQUEUE
> for REQ_FLUSH down to the lowest should be fine.
> (I didn't know that FLUSH failure implies data loss possibility.)

At least on ATA, FLUSH failure implies that data is already lost, so
the error can't be ignored or retried.

> But the patch is not enough, you have to change target drivers, too.
> E.g. As for multipath, you need to change
> drivers/md/dm-mpath.c:do_end_io() to return error for REQ_FLUSH
> like the REQ_DISCARD support included in 2.6.36-rc1.

I'll take a look but is there an easy to test mpath other than having
fancy hardware?

> By the way, if these patch-set with the change above are included,
> even one path failure for REQ_FLUSH on multipath configuration will
> be reported to upper layer as error, although it's retried using
> other paths currently.
> Then, if an upper layer won't take correct recovery action for the error,
> it would be seen as a regression for users. (e.g. Frequent EXT3-error
> resulting in read-only mount on multipath configuration.)
>
> Although I think the explicit error is fine rather than implicit data
> corruption, please check upper layers carefully so that users won't see
> such errors as much as possible.

Argh... then it will have to discern why FLUSH failed. It can retry
for transport errors but if it got aborted by the device it should
report upwards. Maybe just turn off barrier support in mpath for now?

Thanks.

--
tejun

2010-08-23 12:35:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

On 08/20/2010 05:18 PM, Ric Wheeler wrote:
> On 08/20/2010 09:22 AM, Christoph Hellwig wrote:
>> FYI: here's a little writeup to document the new cache flushing scheme,
>> intended to replace Documentation/block/barriers.txt. Any good
>> suggestion for a filename in the kernel tree?
>>
>
> I was thinking that we might be better off using the "durable
> writes" term more since it is well documented (at least in the
> database world, where it is the "D" in ACID properties). Maybe
> "durable_writes_support.txt" ?

The term is very foreign to people outside of enterprise / database
loop. writeback-cache.txt or write-cache-control.txt sounds good
enough to me.

>> The Linux block layer provides a two simple mechanism that lets filesystems
>> control the caching behavior of the storage device. These mechanisms are
>> a forced cache flush, and the Force Unit Access (FUA) flag for requests.
>>
>
> Should we mention that users can also disable the write cache on the
> target device?
>
> It might also be worth mentioning that storage needs to be properly
> configured - i.e., an internal hardware RAID card with battery
> backing needs can expose itself as a writethrough cache *only if* it
> actually has control over all of the backend disks and can
> flush/disable their write caches.

It might be useful to give several example configurations with
different cache configurations. I don't have much experience with
battery backed arrays but aren't they suppose to report write through
cache automatically?

Thanks.

--
tejun

2010-08-23 12:41:49

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

On 08/20/2010 03:22 PM, Christoph Hellwig wrote:
> Many storage devices, especially in the consumer market, come with volatile
> write back caches. That means the devices signal I/O completion to the
> operating system before data actually has hit the physical medium.

A bit nit picky but flash devices can also have writeback caches and
the term physical medium sounds a bit off for those cases. Maybe just
saying "non-volatile media" is better?

> Implementation details for filesystems
> --------------------------------------
>
> Filesystem can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
> worry if the underlying devices need any explicit cache flushing and how
> the Forced Unit Access is implemented. The REQ_FLUSH and REQ_FUA flags
> may both be set on a single bio.

It may be worthwhile to explain the sequence of operations when
REQ_FLUSH + data + REQ_FUA is executed. It can be extrapolated from
the previous two descriptions but I think giving examples of different
sequences depending on FLUSH/FUA configuration would be helpful to
help understanding the overall picture of things.

Other than those, looks good to me.

Thanks.

--
tejun

2010-08-23 12:48:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Mon, Aug 23, 2010 at 02:30:33PM +0200, Tejun Heo wrote:
> It might be useful to give several example configurations with
> different cache configurations. I don't have much experience with
> battery backed arrays but aren't they suppose to report write through
> cache automatically?

They usually do. I have one that doesn't, but SYNCHRONIZE CACHE on
it is so fast that it effectively must be a no-op.

2010-08-23 13:59:01

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On 08/23/2010 08:48 AM, Christoph Hellwig wrote:
> On Mon, Aug 23, 2010 at 02:30:33PM +0200, Tejun Heo wrote:
>> It might be useful to give several example configurations with
>> different cache configurations. I don't have much experience with
>> battery backed arrays but aren't they suppose to report write through
>> cache automatically?
>
> They usually do. I have one that doesn't, but SYNCHRONIZE CACHE on
> it is so fast that it effectively must be a no-op.
>

Arrays are not a problem in general - they normally have internally, redundant
batteries to hold up the cache.

The issue is when you have an internal hardware RAID card with a large cache.
Those cards sit in your server and the batteries on the card protect its
internal cache, but do not have the capacity to hold up the drives behind it.

Normally, those drives should have their write cache disabled, but sometimes
(especially with S-ATA disks) this is not done.

ric

2010-08-23 14:01:15

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On 2010-08-23 15:58, Ric Wheeler wrote:
> On 08/23/2010 08:48 AM, Christoph Hellwig wrote:
>> On Mon, Aug 23, 2010 at 02:30:33PM +0200, Tejun Heo wrote:
>>> It might be useful to give several example configurations with
>>> different cache configurations. I don't have much experience with
>>> battery backed arrays but aren't they suppose to report write through
>>> cache automatically?
>>
>> They usually do. I have one that doesn't, but SYNCHRONIZE CACHE on
>> it is so fast that it effectively must be a no-op.
>>
>
> Arrays are not a problem in general - they normally have internally, redundant
> batteries to hold up the cache.
>
> The issue is when you have an internal hardware RAID card with a large cache.
> Those cards sit in your server and the batteries on the card protect its
> internal cache, but do not have the capacity to hold up the drives behind it.
>
> Normally, those drives should have their write cache disabled, but sometimes
> (especially with S-ATA disks) this is not done.

The problem purely exists on arrays that report write back cache enabled
AND don't implement SYNC_CACHE as a noop. Do any of them exist, or are
they purely urban legend?

--
Jens Axboe

2010-08-23 14:06:25

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Below is an updated version of the documentation. It fixes several
typos Zach Brown noticed and replaces all references to a physical
medium with the term non-volatile storage. I haven't added any examples
yet as I need to figure how they fit into the rest of the document.

---

Explicit volatile write cache control
=====================================

Introduction
------------

Many storage devices, especially in the consumer market, come with volatile
write back caches. That means the devices signal I/O completion to the
operating system before data actually has hit the non-volatile storage. This
behavior obviously speeds up various workloads, but it means the operating
system needs to force data out to the non-volatile storage when it performs
a data integrity operation like fsync, sync or an unmount.

The Linux block layer provides two simple mechanism that lets filesystems
control the caching behavior of the storage device. These mechanisms are
a forced cache flush, and the Force Unit Access (FUA) flag for requests.


Explicit cache flushes
----------------------

The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from
the filesystem and will make sure the volatile cache of the storage device
has been flushed before the actual I/O operation is started. This explicitly
guarantees that previously completed write requests are on non-volatile
storage before the flagged bio starts. In addition the REQ_FLUSH flag can be
set on an otherwise empty bio structure, which causes only an explicit cache
flush without any dependent I/O. It is recommend to use
the blkdev_issue_flush() helper for a pure cache flush.


Forced Unit Access
-----------------

The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure that I/O completion for this requests is only
signaled after the data has been commited to non-volatile storage.


Implementation details for filesystems
--------------------------------------

Filesystem can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
worry if the underlying devices need any explicit cache flushing and how
the Forced Unit Access is implemented. The REQ_FLUSH and REQ_FUA flags
may both be set on a single bio.


Implementation details for make_request_fn based block drivers
--------------------------------------------------------------

These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
directly below the submit_bio interface. For remapping drivers the REQ_FUA
bits need to be propagated to underlying devices, and a global flush needs
to be implemented for bios with the REQ_FLUSH bit set. For real device
drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
on non-empty bios can simply be ignored, and REQ_FLUSH requests without
data can be completed successfully without doing any work. Drivers for
devices with volatile caches need to implement the support for these
flags themselves without any help from the block layer.


Implementation details for request_fn based block drivers
--------------------------------------------------------------

For devices that do not support volatile write caches there is no driver
support required, the block layer completes empty REQ_FLUSH requests before
entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
requests that have a payload. For devices with volatile write caches the
driver needs to tell the block layer that it supports flushing caches by
doing:

blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);

and handle empty REQ_FLUSH requests in its prep_fn/request_fn. Note that
REQ_FLUSH requests with a payload are automatically turned into a sequence
of empty REQ_FLUSH and the actual write by the block layer. For devices
that also support the FUA bit the block layer needs to be told to pass
through that bit using:

blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);

and handle write requests that have the REQ_FUA bit set properly in its
prep_fn/request_fn. If the FUA bit is not natively supported the block
layer turns it into an empty REQ_FLUSH request after the actual write.

2010-08-23 14:08:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Mon, Aug 23, 2010 at 04:01:15PM +0200, Jens Axboe wrote:
> The problem purely exists on arrays that report write back cache enabled
> AND don't implement SYNC_CACHE as a noop. Do any of them exist, or are
> they purely urban legend?

I haven't seen it. I don't care particularly about this case, but once
it a while people want to disable flushing for testing or because they
really don't care.

What about adding a sysfs attribue to every request_queue that allows
disabling the cache flushing feature? Compared to the barrier option
this controls the feature at the right level and makes it available
to everyone instead of beeing duplicated. After a while we can then
simply ignore the barrier/nobarrier options.

2010-08-23 14:12:50

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

On 08/23/2010 04:08 PM, Christoph Hellwig wrote:
> On Mon, Aug 23, 2010 at 04:01:15PM +0200, Jens Axboe wrote:
>> The problem purely exists on arrays that report write back cache enabled
>> AND don't implement SYNC_CACHE as a noop. Do any of them exist, or are
>> they purely urban legend?
>
> I haven't seen it. I don't care particularly about this case, but once
> it a while people want to disable flushing for testing or because they
> really don't care.
>
> What about adding a sysfs attribue to every request_queue that allows
> disabling the cache flushing feature? Compared to the barrier option
> this controls the feature at the right level and makes it available
> to everyone instead of beeing duplicated. After a while we can then
> simply ignore the barrier/nobarrier options.

Yeah, that sounds reasonable. blk_queue_flush() can be called anytime
without locking anyway, so it should be really easy to implement too.

Thanks.

--
tejun

2010-08-23 14:15:40

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH] block: simplify queue_next_fseq

We need to call blk_rq_init and elv_insert for all cases in queue_next_fseq,
so take these calls into common code. Also move the end_io initialization
from queue_flush into queue_next_fseq and rename queue_flush to
init_flush_request now that it's old name doesn't apply anymore.

Signed-off-by: Christoph Hellwig <[email protected]>

Index: linux-2.6/block/blk-flush.c
===================================================================
--- linux-2.6.orig/block/blk-flush.c 2010-08-17 15:34:27.864004351 +0200
+++ linux-2.6/block/blk-flush.c 2010-08-17 16:12:53.504253827 +0200
@@ -74,16 +74,11 @@ static void post_flush_end_io(struct req
blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
}

-static void queue_flush(struct request_queue *q, struct request *rq,
- rq_end_io_fn *end_io)
+static void init_flush_request(struct request *rq, struct gendisk *disk)
{
- blk_rq_init(q, rq);
rq->cmd_type = REQ_TYPE_FS;
rq->cmd_flags = REQ_FLUSH;
- rq->rq_disk = q->orig_flush_rq->rq_disk;
- rq->end_io = end_io;
-
- elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
+ rq->rq_disk = disk;
}

static struct request *queue_next_fseq(struct request_queue *q)
@@ -91,29 +86,28 @@ static struct request *queue_next_fseq(s
struct request *orig_rq = q->orig_flush_rq;
struct request *rq = &q->flush_rq;

+ blk_rq_init(q, rq);
+
switch (blk_flush_cur_seq(q)) {
case QUEUE_FSEQ_PREFLUSH:
- queue_flush(q, rq, pre_flush_end_io);
+ init_flush_request(rq, orig_rq->rq_disk);
+ rq->end_io = pre_flush_end_io;
break;
-
case QUEUE_FSEQ_DATA:
- /* initialize proxy request, inherit FLUSH/FUA and queue it */
- blk_rq_init(q, rq);
init_request_from_bio(rq, orig_rq->bio);
rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
rq->end_io = flush_data_end_io;
-
- elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
break;
-
case QUEUE_FSEQ_POSTFLUSH:
- queue_flush(q, rq, post_flush_end_io);
+ init_flush_request(rq, orig_rq->rq_disk);
+ rq->end_io = post_flush_end_io;
break;
-
default:
BUG();
}
+
+ elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
return rq;
}

2010-08-23 14:17:57

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Mon, Aug 23 2010 at 8:14am -0400,
Tejun Heo <[email protected]> wrote:

> Hello,
>
> On 08/20/2010 10:26 AM, Kiyoshi Ueda wrote:
> > I think that's correct and changing the priority of DM_ENDIO_REQUEUE
> > for REQ_FLUSH down to the lowest should be fine.
> > (I didn't know that FLUSH failure implies data loss possibility.)
>
> At least on ATA, FLUSH failure implies that data is already lost, so
> the error can't be ignored or retried.
>
> > But the patch is not enough, you have to change target drivers, too.
> > E.g. As for multipath, you need to change
> > drivers/md/dm-mpath.c:do_end_io() to return error for REQ_FLUSH
> > like the REQ_DISCARD support included in 2.6.36-rc1.
>
> I'll take a look but is there an easy to test mpath other than having
> fancy hardware?

It is easy enough to make a single path use mpath. Just verify/modify
/etc/multipath.conf so that your device isn't blacklisted.

multipathd will even work with a scsi-debug device.

You obviously won't get path failover but you'll see the path get marked
faulty, etc.

> > By the way, if these patch-set with the change above are included,
> > even one path failure for REQ_FLUSH on multipath configuration will
> > be reported to upper layer as error, although it's retried using
> > other paths currently.
> > Then, if an upper layer won't take correct recovery action for the error,
> > it would be seen as a regression for users. (e.g. Frequent EXT3-error
> > resulting in read-only mount on multipath configuration.)
> >
> > Although I think the explicit error is fine rather than implicit data
> > corruption, please check upper layers carefully so that users won't see
> > such errors as much as possible.
>
> Argh... then it will have to discern why FLUSH failed. It can retry
> for transport errors but if it got aborted by the device it should
> report upwards.

Yes, we discussed this issue of needing to train dm-multipath to know if
there was a transport failure or not (at LSF). But I'm not sure when
Hannes intends to repost his work in this area (updated to account for
feedback from LSF).

> Maybe just turn off barrier support in mpath for now?

I think we'd prefer to have a device fail rather than jeopardize data
integrity. Clearly not ideal but...

2010-08-23 14:20:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Mon, Aug 23, 2010 at 04:13:36PM +0200, Tejun Heo wrote:
> Yeah, that sounds reasonable. blk_queue_flush() can be called anytime
> without locking anyway, so it should be really easy to implement too.

I don't think we can simply call blk_queue_flush - we must ensure to
never set more bits than the device allows. We'll just need two
sets of flags in the request queue, with the sysfs file checking that
it never allows more flags than blk_queue_flush.

I'll prepare a patch for this on top of the current series.

2010-08-23 15:19:26

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On 08/23/2010 10:01 AM, Jens Axboe wrote:
> On 2010-08-23 15:58, Ric Wheeler wrote:
>> On 08/23/2010 08:48 AM, Christoph Hellwig wrote:
>>> On Mon, Aug 23, 2010 at 02:30:33PM +0200, Tejun Heo wrote:
>>>> It might be useful to give several example configurations with
>>>> different cache configurations. I don't have much experience with
>>>> battery backed arrays but aren't they suppose to report write through
>>>> cache automatically?
>>>
>>> They usually do. I have one that doesn't, but SYNCHRONIZE CACHE on
>>> it is so fast that it effectively must be a no-op.
>>>
>>
>> Arrays are not a problem in general - they normally have internally, redundant
>> batteries to hold up the cache.
>>
>> The issue is when you have an internal hardware RAID card with a large cache.
>> Those cards sit in your server and the batteries on the card protect its
>> internal cache, but do not have the capacity to hold up the drives behind it.
>>
>> Normally, those drives should have their write cache disabled, but sometimes
>> (especially with S-ATA disks) this is not done.
>
> The problem purely exists on arrays that report write back cache enabled
> AND don't implement SYNC_CACHE as a noop. Do any of them exist, or are
> they purely urban legend?
>


Hi Jens,

There are actually two distinct problems:

(1) arrays with a non-volatile write cache (battery backed, navram, whatever)
that do not NOOP a SYNC_CACHE command. I know of one brand that seems to do
this, but it is not a common brand. If we do not issue flushes for write through
caches, I think that we will avoid this in any case.

(2) hardware raid cards with internal buffer memory and on-card battery backup
(they sit in your server, disks sit in jbod like expansion shelves). These are
fine if the drives in those shelves have write cache disabled.

ric

2010-08-23 16:49:31

by John Robinson

[permalink] [raw]
Subject: OT grammar nit Re: [PATCH] block: simplify queue_next_fseq

On 23/08/2010 15:15, Christoph Hellwig wrote:
> We need to call blk_rq_init and elv_insert for all cases in queue_next_fseq,
> so take these calls into common code. Also move the end_io initialization
> from queue_flush into queue_next_fseq and rename queue_flush to
> init_flush_request now that it's old name doesn't apply anymore.

Nit: it's "its" above, not "it's". If in doubt, if it's "it is" (or "it
has") it's "it's" but if it could be "his" or "hers" it's "its".

I'm guessing English isn't your first language (a) because of your .de
address and (b) because it's better than most British people's, but
still, it's a common mistake. If I can remember any of the German I
studied all those years ago, "its" is roughly equivalent to "sein", and
"it's" to "es ist".

Cheers,

John.

2010-08-23 16:50:00

by Ric Wheeler

[permalink] [raw]
Subject: Re: [dm-devel] [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On 08/23/2010 12:45 PM, Sergey Vlasov wrote:
> On Mon, Aug 23, 2010 at 11:19:13AM -0400, Ric Wheeler wrote:
> [...]
>> (2) hardware raid cards with internal buffer memory and on-card battery backup
>> (they sit in your server, disks sit in jbod like expansion shelves). These are
>> fine if the drives in those shelves have write cache disabled.
>
> Actually some of such cards keep write cache on the drives enabled and
> issue FLUSH CACHE commands to the drives. E.g., 3ware 9690SA behaves
> like this at least with SATA drives (the FLUSH CACHE commands can be
> seen after enabling performance monitoring - they often end up in the
> "10 commands having the largest latency" table). This can actually be
> safe if the card waits for the FLUSH CACHE completion before making
> the write cache data in its battery-backed memory available for reuse
> (and the drive implements the FLUSH CACHE command correctly).

Yes - this is certainly one way to do it. Note that this will not work if the
card advertises itself as a write through cache (and we end up not sending down
the SYNC_CACHE commands).

At least one hardware RAID card (I unfortunately cannot mention the brand) did
not do this command forwarding.

ric

2010-08-23 16:55:46

by Sergey Vlasov

[permalink] [raw]
Subject: Re: [dm-devel] [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Mon, Aug 23, 2010 at 11:19:13AM -0400, Ric Wheeler wrote:
[...]
> (2) hardware raid cards with internal buffer memory and on-card battery backup
> (they sit in your server, disks sit in jbod like expansion shelves). These are
> fine if the drives in those shelves have write cache disabled.

Actually some of such cards keep write cache on the drives enabled and
issue FLUSH CACHE commands to the drives. E.g., 3ware 9690SA behaves
like this at least with SATA drives (the FLUSH CACHE commands can be
seen after enabling performance monitoring - they often end up in the
"10 commands having the largest latency" table). This can actually be
safe if the card waits for the FLUSH CACHE completion before making
the write cache data in its battery-backed memory available for reuse
(and the drive implements the FLUSH CACHE command correctly).


Attachments:
(No filename) (848.00 B)
signature.asc (198.00 B)
Digital signature
Download all attachments

2010-08-24 10:25:51

by Kiyoshi Ueda

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hi Tejun,

On 08/23/2010 11:17 PM +0900, Mike Snitzer wrote:
> On Mon, Aug 23 2010 at 8:14am -0400, Tejun Heo <[email protected]> wrote:
>> On 08/20/2010 10:26 AM, Kiyoshi Ueda wrote:
>>> By the way, if these patch-set with the change above are included,
>>> even one path failure for REQ_FLUSH on multipath configuration will
>>> be reported to upper layer as error, although it's retried using
>>> other paths currently.
>>> Then, if an upper layer won't take correct recovery action for the error,
>>> it would be seen as a regression for users. (e.g. Frequent EXT3-error
>>> resulting in read-only mount on multipath configuration.)
>>>
>>> Although I think the explicit error is fine rather than implicit data
>>> corruption, please check upper layers carefully so that users won't see
>>> such errors as much as possible.
>>
>> Argh... then it will have to discern why FLUSH failed. It can retry
>> for transport errors but if it got aborted by the device it should
>> report upwards.
>
> Yes, we discussed this issue of needing to train dm-multipath to know if
> there was a transport failure or not (at LSF). But I'm not sure when
> Hannes intends to repost his work in this area (updated to account for
> feedback from LSF).

Yes, checking whether it's a transport error in lower layer is
the right solution.
(Since I know it's not available yet, I just hoped if upper layers
had some other options.)

Anyway, only reporting errors for REQ_FLUSH to upper layer without
such a solution would make dm-multipath almost unusable in real world,
although it's better than implicit data loss.


>> Maybe just turn off barrier support in mpath for now?

If it's possible, it could be a workaround for a short term.
But how can you do that?

I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
Underlying devices of a mpath device may have write-back cache and
it may be enabled.
So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
becomes a device which has write-back cache but doesn't support flush.
Then, upper layer can do nothing to ensure cache flush?

Thanks,
Kiyoshi Ueda

2010-08-24 17:05:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

On 08/24/2010 12:24 PM, Kiyoshi Ueda wrote:
> Yes, checking whether it's a transport error in lower layer is
> the right solution.
> (Since I know it's not available yet, I just hoped if upper layers
> had some other options.)
>
> Anyway, only reporting errors for REQ_FLUSH to upper layer without
> such a solution would make dm-multipath almost unusable in real world,
> although it's better than implicit data loss.

I see.

>>> Maybe just turn off barrier support in mpath for now?
>
> If it's possible, it could be a workaround for a short term.
> But how can you do that?
>
> I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
> Underlying devices of a mpath device may have write-back cache and
> it may be enabled.
> So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
> becomes a device which has write-back cache but doesn't support flush.
> Then, upper layer can do nothing to ensure cache flush?

Yeah, I was basically suggesting to forget about cache flush w/ mpath
until it can be fixed. You're saying that if mpath just passes
REQ_FLUSH upwards without retrying, it will be almost unuseable,
right? I'm not sure how to proceed here. How much work would
discerning between transport and IO errors take? If it can't be done
quickly enough the retry logic can be kept around to keep the old
behavior but that already was a broken behavior, so... :-(

Thanks.

--
tejun

2010-08-24 17:12:16

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Tejun Heo, on 08/23/2010 04:14 PM wrote:
>> I think that's correct and changing the priority of DM_ENDIO_REQUEUE
>> for REQ_FLUSH down to the lowest should be fine.
>> (I didn't know that FLUSH failure implies data loss possibility.)
>
> At least on ATA, FLUSH failure implies that data is already lost, so
> the error can't be ignored or retried.

In SCSI there are conditions when a command, including FLUSH
(SYNC_CACHE), failed which don't imply lost data. For them the caller
expected to retry the failed command. Most common cases are Unit
Attentions and TASK QUEUE FULL status.

Vlad

2010-08-24 17:52:48

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Tue, Aug 24 2010 at 12:59pm -0400,
Tejun Heo <[email protected]> wrote:

> Hello,
>
> On 08/24/2010 12:24 PM, Kiyoshi Ueda wrote:
> > Yes, checking whether it's a transport error in lower layer is
> > the right solution.
> > (Since I know it's not available yet, I just hoped if upper layers
> > had some other options.)
> >
> > Anyway, only reporting errors for REQ_FLUSH to upper layer without
> > such a solution would make dm-multipath almost unusable in real world,
> > although it's better than implicit data loss.
>
> I see.
>
> >>> Maybe just turn off barrier support in mpath for now?
> >
> > If it's possible, it could be a workaround for a short term.
> > But how can you do that?
> >
> > I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
> > Underlying devices of a mpath device may have write-back cache and
> > it may be enabled.
> > So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
> > becomes a device which has write-back cache but doesn't support flush.
> > Then, upper layer can do nothing to ensure cache flush?
>
> Yeah, I was basically suggesting to forget about cache flush w/ mpath
> until it can be fixed. You're saying that if mpath just passes
> REQ_FLUSH upwards without retrying, it will be almost unuseable,
> right? I'm not sure how to proceed here.

Seems clear that we must fix mpath to receive the SCSI errors, in some
form, so it can decide if a retry is required/valid or not.

Such error processing was a big selling point for the transition from
bio-based to request-based multipath; so it's unfortunate that this
piece has been left until now.

> How much work would discerning between transport and IO errors take?

Hannes already proposed some patches:
https://patchwork.kernel.org/patch/61282/
https://patchwork.kernel.org/patch/61283/
https://patchwork.kernel.org/patch/61596/

This work was discussed at LSF, see "Error Handling - Hannes Reinecke"
here: http://lwn.net/Articles/400589/

I thought James, Alasdair and others offered some guidance on what he'd
like to see...

Unfortunately, even though I was at this LSF session, I can't recall any
specific consensus on how Hannes' work should be refactored (to avoid
adding SCSI sense processing code directly in dm-mpath). Maybe James,
Hannes or others remember?

Was it enough to just have the SCSI sense processing code split out in a
new sub-section of the SCSI midlayer -- and then DM calls that code?

> If it can't be done quickly enough the retry logic can be kept around
> to keep the old behavior but that already was a broken behavior, so...
> :-(

I'll have to review this thread again to understand why mpath's existing
retry logic is broken behavior. mpath is used with more capable SCSI
devices so I'm missing why a failed FLUSH implies data loss.

Mike

2010-08-24 18:13:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hello,

On 08/24/2010 07:52 PM, Mike Snitzer wrote:
>> If it can't be done quickly enough the retry logic can be kept around
>> to keep the old behavior but that already was a broken behavior, so...
>> :-(
>
> I'll have to review this thread again to understand why mpath's existing
> retry logic is broken behavior. mpath is used with more capable SCSI
> devices so I'm missing why a failed FLUSH implies data loss.

SBC doesn't specify the failure behavior, so it could be that retrying
flush could be safe. But for most disk type devices, flush failure
usually indicates that the device exhausted all the options to commit
some of pending data to NV media - ie. even remapping failed for
whatever reason. Even if retry is safe, it's more likely to simply
delay notification of failure.

In ATA, the situation is clearer, when a device actively fails a
flush, the drive reports the first failed sector it failed to commit
and the next flush will continue _after_ the sector - IOW, data is
already lost.

<speculation>
I think there's no reason mpath should be tasked with retrying flush
failure. That's upto the SCSI EH. If the command failed in 'safe'
transient way - ie. device busy or whatnot, SCSI EH can and does retry
the command. There are several FAILFAST bits already and SCSI EH can
avoid retrying transport errors for mpath (maybe it already does
that?) and just need to be able to tell upper layer that the failure
was a fast one and upper layer is responsible for retrying? Is there
any reason to pass the whole sense information upwards?
</speculation>

Anyways, flush failure is different from read/write failures.
Read/writes can always be retried cleanly. They are stateless. I
don't know how SCSI devices would actually behavior but it's a bit
scary to retry SYNCHRONIZE_CACHE a device failed and report success
upwards.

Thanks.

--
tejun

2010-08-24 23:39:00

by Alan

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

> In SCSI there are conditions when a command, including FLUSH
> (SYNC_CACHE), failed which don't imply lost data. For them the caller
> expected to retry the failed command. Most common cases are Unit
> Attentions and TASK QUEUE FULL status.

ATA expects the command to be retried as well because a failed flush
indicates the specific sector is lost (unless the host still has a copy
of course - which is *very* likely although we don't use it) but the rest
of the flush transaction can be retried to continue to flush sectors
beyond the failed one.

Alan

2010-08-25 08:02:55

by Kiyoshi Ueda

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hi Tejun,

On 08/25/2010 01:59 AM +0900, Tejun Heo wrote:
> On 08/24/2010 12:24 PM, Kiyoshi Ueda wrote:
>> Anyway, only reporting errors for REQ_FLUSH to upper layer without
>> such a solution would make dm-multipath almost unusable in real world,
>> although it's better than implicit data loss.
>
> I see.
>
>>> Maybe just turn off barrier support in mpath for now?
>>
>> If it's possible, it could be a workaround for a short term.
>> But how can you do that?
>>
>> I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
>> Underlying devices of a mpath device may have write-back cache and
>> it may be enabled.
>> So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
>> becomes a device which has write-back cache but doesn't support flush.
>> Then, upper layer can do nothing to ensure cache flush?
>
> Yeah, I was basically suggesting to forget about cache flush w/ mpath
> until it can be fixed. You're saying that if mpath just passes
> REQ_FLUSH upwards without retrying, it will be almost unuseable,
> right?

Right.
If the error is safe/needed to retry using other paths, mpath should
retry even if REQ_FLUSH. Otherwise, only one path failure may result
in system down.
Just passing any REQ_FLUSH error upwards regardless the error type
will make such situations, and users will feel the behavior as
unstable/unusable.


> I'm not sure how to proceed here. How much work would
> discerning between transport and IO errors take? If it can't be done
> quickly enough the retry logic can be kept around to keep the old
> behavior but that already was a broken behavior, so... :-(

I'm not sure how long will it take.
Anyway, as you said, the flush error handling of dm-mpath is already
broken if data loss really happens on any storage used by dm-mpath.
Although it's a serious issue and quick fix is required, I think
you may leave the old behavior in your patch-set, since it's
a separate issue.

Thanks,
Kiyoshi Ueda

2010-08-25 11:31:20

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On 2010-08-23 16:08, Christoph Hellwig wrote:
> On Mon, Aug 23, 2010 at 04:01:15PM +0200, Jens Axboe wrote:
>> The problem purely exists on arrays that report write back cache enabled
>> AND don't implement SYNC_CACHE as a noop. Do any of them exist, or are
>> they purely urban legend?
>
> I haven't seen it. I don't care particularly about this case, but once
> it a while people want to disable flushing for testing or because they
> really don't care.
>
> What about adding a sysfs attribue to every request_queue that allows
> disabling the cache flushing feature? Compared to the barrier option
> this controls the feature at the right level and makes it available
> to everyone instead of beeing duplicated. After a while we can then
> simply ignore the barrier/nobarrier options.

Agree, that would be fine.

--
Jens Axboe

2010-08-25 15:29:08

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Wed, Aug 25 2010 at 4:00am -0400,
Kiyoshi Ueda <[email protected]> wrote:

> Hi Tejun,
>
> On 08/25/2010 01:59 AM +0900, Tejun Heo wrote:
> > On 08/24/2010 12:24 PM, Kiyoshi Ueda wrote:
> >> Anyway, only reporting errors for REQ_FLUSH to upper layer without
> >> such a solution would make dm-multipath almost unusable in real world,
> >> although it's better than implicit data loss.
> >
> > I see.
> >
> >>> Maybe just turn off barrier support in mpath for now?
> >>
> >> If it's possible, it could be a workaround for a short term.
> >> But how can you do that?
> >>
> >> I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
> >> Underlying devices of a mpath device may have write-back cache and
> >> it may be enabled.
> >> So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
> >> becomes a device which has write-back cache but doesn't support flush.
> >> Then, upper layer can do nothing to ensure cache flush?
> >
> > Yeah, I was basically suggesting to forget about cache flush w/ mpath
> > until it can be fixed. You're saying that if mpath just passes
> > REQ_FLUSH upwards without retrying, it will be almost unuseable,
> > right?
>
> Right.
> If the error is safe/needed to retry using other paths, mpath should
> retry even if REQ_FLUSH. Otherwise, only one path failure may result
> in system down.
> Just passing any REQ_FLUSH error upwards regardless the error type
> will make such situations, and users will feel the behavior as
> unstable/unusable.

Right, there are hardware configurations that lend themselves to FLUSH
retries mattering, namely:
1) a SAS drive with 2 ports and a writeback cache
2) theoretically possible: SCSI array that is mpath capable but
advertises cache as writeback (WCE=1)

The SAS case is obviously a more concrete example of why FLUSH retries
are worthwhile in mpath.

But I understand (and agree) that we'd be better off if mpath could
differentiate between failures rather than blindly retrying on failures
like it does today (fails path and retries if additional paths
available).

> Anyway, as you said, the flush error handling of dm-mpath is already
> broken if data loss really happens on any storage used by dm-mpath.
> Although it's a serious issue and quick fix is required, I think
> you may leave the old behavior in your patch-set, since it's
> a separate issue.

I'm not seeing where anything is broken with current mpath. If a
multipathed LUN is WCE=1 then it should be fair to assume the cache is
mirrored or shared across ports. Therefore retrying the SYNCHRONIZE
CACHE is needed.

Do we still have fear that SYNCHRONIZE CACHE can silently drop data?
Seems unlikely especially given what Tejun shared from SBC.

It seems that at worst, with current mpath, we retry when it doesn't
make sense (e.g. target failure).

Mike

2010-08-25 15:59:54

by Mike Snitzer

[permalink] [raw]
Subject: [RFC] training mpath to discern between SCSI errors (was: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush)

On Wed, Aug 25 2010 at 4:00am -0400,
Kiyoshi Ueda <[email protected]> wrote:

> > I'm not sure how to proceed here. How much work would
> > discerning between transport and IO errors take? If it can't be done
> > quickly enough the retry logic can be kept around to keep the old
> > behavior but that already was a broken behavior, so... :-(
>
> I'm not sure how long will it take.

We first need to understand what direction we want to go with this. We
currently have 2 options. But any other ideas are obviously welcome.

1)
Mike Christie has a patchset that introduce more specific
target/transport/host error codes. Mike shared these pointers but he'd
have to put the work in to refresh them:
http://marc.info/?l=linux-scsi&m=112487427230642&w=2
http://marc.info/?l=linux-scsi&m=112487427306501&w=2
http://marc.info/?l=linux-scsi&m=112487431524436&w=2
http://marc.info/?l=linux-scsi&m=112487431524350&w=2

errno.h new EXYZ
http://marc.info/?l=linux-kernel&m=107715299008231&w=2

add block layer blkdev.h error values
http://marc.info/?l=linux-kernel&m=107961883915068&w=2

add block layer blkdev.h error values (v2 convert more drivers)
http://marc.info/?l=linux-scsi&m=112487427230642&w=2

I think that patchset's appoach is fairly disruptive just to be able to
train upper layers to differentiate (e.g. mpath). But in the end maybe
that change takes the code in a more desirable direction?

2)
Another option is Hannes' approach of having DM consume req->errors and
SCSI sense more directly.

I've refreshed Hannes' previous patchset against 2.6.36-rc2 but I
haven't finished testing it yet (should be OK.. it boots, but still have
FIXME to move scsi_uld_should_retry to scsi_error.c):
http://people.redhat.com/msnitzer/patches/dm-scsi-sense/

Would be great if James, Hannes and others had a look at this
refreshed RFC patchset. It's clearly not polished but it gives an idea
of the approach. Does this look worthwhile?

Follow-on work is needed to refine scsi_uld_should_retry further. Keep
in mind that scsi_error.c is the intended location for this code.

James, please note that I've attempted to make REQ_TYPE_FS set
req->errors only for "genuine errors" by (ab)using
scsi_decide_disposition:
http://people.redhat.com/msnitzer/patches/dm-scsi-sense/scsi-Always-pass-error-result-and-sense-on-request-completion.patch

If others think this may be worthwhile I can finish testing, cleanup the
patches further, and post them.

Mike

2010-08-25 19:14:32

by Mike Christie

[permalink] [raw]
Subject: Re: [RFC] training mpath to discern between SCSI errors

On 08/25/2010 10:59 AM, Mike Snitzer wrote:
> On Wed, Aug 25 2010 at 4:00am -0400,
> Kiyoshi Ueda<[email protected]> wrote:
>
>>> I'm not sure how to proceed here. How much work would
>>> discerning between transport and IO errors take? If it can't be done
>>> quickly enough the retry logic can be kept around to keep the old
>>> behavior but that already was a broken behavior, so... :-(
>>
>> I'm not sure how long will it take.
>
> We first need to understand what direction we want to go with this. We
> currently have 2 options. But any other ideas are obviously welcome.
>
> 1)
> Mike Christie has a patchset that introduce more specific
> target/transport/host error codes. Mike shared these pointers but he'd
> have to put the work in to refresh them:
> http://marc.info/?l=linux-scsi&m=112487427230642&w=2
> http://marc.info/?l=linux-scsi&m=112487427306501&w=2
> http://marc.info/?l=linux-scsi&m=112487431524436&w=2
> http://marc.info/?l=linux-scsi&m=112487431524350&w=2
>
> errno.h new EXYZ
> http://marc.info/?l=linux-kernel&m=107715299008231&w=2
>
> add block layer blkdev.h error values
> http://marc.info/?l=linux-kernel&m=107961883915068&w=2
>
> add block layer blkdev.h error values (v2 convert more drivers)
> http://marc.info/?l=linux-scsi&m=112487427230642&w=2
>
> I think that patchset's appoach is fairly disruptive just to be able to
> train upper layers to differentiate (e.g. mpath). But in the end maybe
> that change takes the code in a more desirable direction?

I think it is more disruptive, but is the cleaner approach in the end.

#2 looks hacky. In upper layers, we will have checks for dasd and other
AOE and other drivers. And then #2 does not even work for filesystems
(ext said they need this).



>
> 2)
> Another option is Hannes' approach of having DM consume req->errors and
> SCSI sense more directly.
>

2010-08-27 09:48:49

by Kiyoshi Ueda

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hi Mike,

On 08/26/2010 12:28 AM +0900, Mike Snitzer wrote:
> Kiyoshi Ueda <[email protected]> wrote:
>> Anyway, as you said, the flush error handling of dm-mpath is already
>> broken if data loss really happens on any storage used by dm-mpath.
>> Although it's a serious issue and quick fix is required, I think
>> you may leave the old behavior in your patch-set, since it's
>> a separate issue.
>
> I'm not seeing where anything is broken with current mpath. If a
> multipathed LUN is WCE=1 then it should be fair to assume the cache is
> mirrored or shared across ports. Therefore retrying the SYNCHRONIZE
> CACHE is needed.
>
> Do we still have fear that SYNCHRONIZE CACHE can silently drop data?
> Seems unlikely especially given what Tejun shared from SBC.

Do we have any proof to wipe that fear?

If retrying on flush failure is safe on all storages used with multipath
(e.g. SCSI, CCISS, DASD, etc), then current dm-mpath should be fine in
the real world.
But I'm afraid if there is a storage where something like below can happen:
- a flush command is returned as error to mpath because a part of
cache has physically broken at the time or so, then that part of
data loses and the size of the cache is shrunk by the storage.
- mpath retries the flush command using other path.
- the flush command is returned as success to mpath.
- mpath passes the result, success, to upper layer, but some of
the data already lost.

Thanks,
Kiyoshi Ueda

2010-08-27 13:50:08

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Fri, Aug 27 2010 at 5:47am -0400,
Kiyoshi Ueda <[email protected]> wrote:

> Hi Mike,
>
> On 08/26/2010 12:28 AM +0900, Mike Snitzer wrote:
> > Kiyoshi Ueda <[email protected]> wrote:
> >> Anyway, as you said, the flush error handling of dm-mpath is already
> >> broken if data loss really happens on any storage used by dm-mpath.
> >> Although it's a serious issue and quick fix is required, I think
> >> you may leave the old behavior in your patch-set, since it's
> >> a separate issue.
> >
> > I'm not seeing where anything is broken with current mpath. If a
> > multipathed LUN is WCE=1 then it should be fair to assume the cache is
> > mirrored or shared across ports. Therefore retrying the SYNCHRONIZE
> > CACHE is needed.
> >
> > Do we still have fear that SYNCHRONIZE CACHE can silently drop data?
> > Seems unlikely especially given what Tejun shared from SBC.
>
> Do we have any proof to wipe that fear?
>
> If retrying on flush failure is safe on all storages used with multipath
> (e.g. SCSI, CCISS, DASD, etc), then current dm-mpath should be fine in
> the real world.
> But I'm afraid if there is a storage where something like below can happen:
> - a flush command is returned as error to mpath because a part of
> cache has physically broken at the time or so, then that part of
> data loses and the size of the cache is shrunk by the storage.
> - mpath retries the flush command using other path.
> - the flush command is returned as success to mpath.
> - mpath passes the result, success, to upper layer, but some of
> the data already lost.

That does seem like a valid concern. But I'm not seeing why its unique
to SYNCHRONIZE CACHE. Any IO that fails on the target side should be
passed up once the error gets to DM.

Mike

2010-08-30 06:15:15

by Kiyoshi Ueda

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hi Mike,

On 08/27/2010 10:49 PM +0900, Mike Snitzer wrote:
> Kiyoshi Ueda <[email protected]> wrote:
>> On 08/26/2010 12:28 AM +0900, Mike Snitzer wrote:
>>> Kiyoshi Ueda <[email protected]> wrote:
>>>> Anyway, as you said, the flush error handling of dm-mpath is already
>>>> broken if data loss really happens on any storage used by dm-mpath.
>>>> Although it's a serious issue and quick fix is required, I think
>>>> you may leave the old behavior in your patch-set, since it's
>>>> a separate issue.
>>>
>>> I'm not seeing where anything is broken with current mpath. If a
>>> multipathed LUN is WCE=1 then it should be fair to assume the cache is
>>> mirrored or shared across ports. Therefore retrying the SYNCHRONIZE
>>> CACHE is needed.
>>>
>>> Do we still have fear that SYNCHRONIZE CACHE can silently drop data?
>>> Seems unlikely especially given what Tejun shared from SBC.
>>
>> Do we have any proof to wipe that fear?
>>
>> If retrying on flush failure is safe on all storages used with multipath
>> (e.g. SCSI, CCISS, DASD, etc), then current dm-mpath should be fine in
>> the real world.
>> But I'm afraid if there is a storage where something like below can happen:
>> - a flush command is returned as error to mpath because a part of
>> cache has physically broken at the time or so, then that part of
>> data loses and the size of the cache is shrunk by the storage.
>> - mpath retries the flush command using other path.
>> - the flush command is returned as success to mpath.
>> - mpath passes the result, success, to upper layer, but some of
>> the data already lost.
>
> That does seem like a valid concern. But I'm not seeing why its unique
> to SYNCHRONIZE CACHE. Any IO that fails on the target side should be
> passed up once the error gets to DM.

See the Tejun's explanation again:
http://marc.info/?l=linux-kernel&m=128267361813859&w=2
What I'm concerning is whether the same thing as Tejun explained
for ATA can happen on other types of devices.


Normal write command has data and no data loss happens on error.
So it can be retried cleanly, and if the result of the retry is
success, it's really success, no implicit data loss.

Normal read command has a sector to read. If the sector is broken,
all retries will fail and the error will be reported upwards.
So it can be retried cleanly as well.

Thanks,
Kiyoshi Ueda

2010-08-30 09:54:18

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Tejun Heo wrote:
> Hello,
>
> On 08/18/2010 09:30 PM, Vladislav Bolkhovitin wrote:
>> Basically, I measured how iSCSI link utilization depends from amount
>> of queued commands and queued data size. This is why I made it as a
>> table. From it you can see which improvement you will have removing
>> queue draining after 1, 2, 4, etc. commands depending of commands
>> sizes.
>>
>> For instance, on my previous XFS rm example, where rm of 4 files
>> took 3.5 minutes with nobarrier option, I could see that XFS was
>> sending 1-3 32K commands in a row. From my table you can see that if
>> it sent all them at once without draining, it would have about
>> 150-200% speed increase.
>
> You compared barrier off/on. Of course, it will make a big
> difference. I think good part of that gain should be realized by the
> currently proposed patchset which removes draining. What's needed to
> be demonstrated is the difference between ordered-by-waiting and
> ordered-by-tag. We've never had code to do that properly.
>
> The original ordered-by-tag we had only applied tag ordering to two or
> three command sequences inside a barrier, which doesn't amount to much
> (and could even be harmful as it imposes draining of all simple
> commands inside the device only to reduce issue latencies for a few
> commands). You'll need to hook into filesystem and somehow export the
> ordering information down to the driver so that whatever needs
> ordering is sent out as ordered commands.
>
> As I've wrote multiple times, I'm pretty skeptical it will bring much.
> Ordered tag mandates draining inside the device just like the original
> barrier implementation. Sure, it's done at a lower layer and command
> issue latencies will be reduced thanks to that but ordered-by-waiting
> doesn't require _any_ draining at all. The whole pipeline can be kept
> full all the time. I'm often wrong tho, so please feel free to go
> ahead and prove me wrong. :-)
>
Actually, I thought about ordered tag writes, too.
But eventually I had to give up on this for a simple reason:
Ordered tag controls the ordering on the SCSI _TARGET_. But for a
meaningful implementation we need to control the ordering all the way
down from ->queuecommand(). Which means we have three areas we need
to cover here:
- driver (ie between ->queuecommand() and passing it off to the firmware)
- firmware
- fabric

Sadly, the latter two are really hard to influence. And, what's more,
with the new/modern CNAs with multiple queues and possible multiple
routes to the target it becomes impossible to guarantee ordering.
So using ordered tags for FibreChannel is not going to work, which
makes implementing it a bit of a pointless exercise for me.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

2010-08-30 10:05:01

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Christoph Hellwig wrote:
> On Mon, Aug 23, 2010 at 04:01:15PM +0200, Jens Axboe wrote:
>> The problem purely exists on arrays that report write back cache enabled
>> AND don't implement SYNC_CACHE as a noop. Do any of them exist, or are
>> they purely urban legend?
>
> I haven't seen it. I don't care particularly about this case, but once
> it a while people want to disable flushing for testing or because they
> really don't care.
>
aacraid for one falls into this category.
SYNC_CACHE is no-oped in the driver. Otherwise you get a _HUGE_
performance loss.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Markus Rex, HRB 16746 (AG N?rnberg)

2010-08-30 11:39:00

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [RFC] training mpath to discern between SCSI errors

Mike Snitzer wrote:
> On Wed, Aug 25 2010 at 4:00am -0400,
> Kiyoshi Ueda <[email protected]> wrote:
>
>>> I'm not sure how to proceed here. How much work would
>>> discerning between transport and IO errors take? If it can't be done
>>> quickly enough the retry logic can be kept around to keep the old
>>> behavior but that already was a broken behavior, so... :-(
>> I'm not sure how long will it take.
>
> We first need to understand what direction we want to go with this. We
> currently have 2 options. But any other ideas are obviously welcome.
>
> 1)
> Mike Christie has a patchset that introduce more specific
> target/transport/host error codes. Mike shared these pointers but he'd
> have to put the work in to refresh them:
> http://marc.info/?l=linux-scsi&m=112487427230642&w=2
> http://marc.info/?l=linux-scsi&m=112487427306501&w=2
> http://marc.info/?l=linux-scsi&m=112487431524436&w=2
> http://marc.info/?l=linux-scsi&m=112487431524350&w=2
>
> errno.h new EXYZ
> http://marc.info/?l=linux-kernel&m=107715299008231&w=2
>
> add block layer blkdev.h error values
> http://marc.info/?l=linux-kernel&m=107961883915068&w=2
>
> add block layer blkdev.h error values (v2 convert more drivers)
> http://marc.info/?l=linux-scsi&m=112487427230642&w=2
>
> I think that patchset's appoach is fairly disruptive just to be able to
> train upper layers to differentiate (e.g. mpath). But in the end maybe
> that change takes the code in a more desirable direction?
>
> 2)
> Another option is Hannes' approach of having DM consume req->errors and
> SCSI sense more directly.
>

Actually, I think we have two separate issues here:
1) The need of having more detailed I/O errors even in the fs layer. This
we've already discussed at the LSF, consensus here is to allow other
errors than just 'EIO'.
Instead of Mike's approach I would rather use existing error codes here;
this will make the transition somewhat easier.
Initially I would propose to return 'ENOLINK' for a transport failure,
'EIO' for a non-retryable failure on the target, and 'ENODEV' for a
retryable failure on the target.

2) The need to differentiate the various error conditions on the multipath
layer. Multipath needs to distinguish the three error types as specified
in 1)

Mike has been trying to solve 1) and 2) by introducing separate/new error
codes, and I have been trying to use 2) by parsing the sense codes directly
from multipathing.

Given that the fs people have expressed their desire to know about these
error classes, too, it makes sense to have them exposed to the fs layer.

I see if I can come up with a patch.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Markus Rex, HRB 16746 (AG N?rnberg)

2010-08-30 12:08:42

by Sergei Shtylyov

[permalink] [raw]
Subject: Re: [RFC] training mpath to discern between SCSI errors

Hello.

Hannes Reinecke wrote:

> Actually, I think we have two separate issues here:
> 1) The need of having more detailed I/O errors even in the fs layer. This
> we've already discussed at the LSF, consensus here is to allow other
> errors than just 'EIO'.
> Instead of Mike's approach I would rather use existing error codes here;
> this will make the transition somewhat easier.
> Initially I would propose to return 'ENOLINK' for a transport failure,
> 'EIO' for a non-retryable failure on the target, and 'ENODEV' for a
> retryable failure on the target.

Are you sure it's not vice versa: EIO for retryable and ENODEV for
non-retryable failures. ENODEV looks more like permanent condition to me.

WBR, Sergei

2010-08-30 12:39:19

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [RFC] training mpath to discern between SCSI errors

Sergei Shtylyov wrote:
> Hello.
>
> Hannes Reinecke wrote:
>
>> Actually, I think we have two separate issues here:
>> 1) The need of having more detailed I/O errors even in the fs layer. This
>> we've already discussed at the LSF, consensus here is to allow other
>> errors than just 'EIO'.
>> Instead of Mike's approach I would rather use existing error codes
>> here;
>> this will make the transition somewhat easier.
>> Initially I would propose to return 'ENOLINK' for a transport failure,
>> 'EIO' for a non-retryable failure on the target, and 'ENODEV' for a
>> retryable failure on the target.
>
> Are you sure it's not vice versa: EIO for retryable and ENODEV for
> non-retryable failures. ENODEV looks more like permanent condition to me.
>
Ok, can do.
And looking a the error numbers again, maybe we should be using 'EREMOTEIO'
for non-retryable failures.

So we would be ending with:

ENOLINK: transport failure
EIO: retryable remote failure
EREMOTEIO: non-retryable remote failure

Does that look okay?

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Markus Rex, HRB 16746 (AG N?rnberg)

2010-08-30 14:52:45

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [dm-devel] [RFC] training mpath to discern between SCSI errors

>From f0835d92426cb3938f79f1b7a1e1208de63ca7bc Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <[email protected]>
Date: Mon, 30 Aug 2010 16:21:10 +0200
Subject: [RFC][PATCH] scsi: Detailed I/O errors

Instead of just passing 'EIO' for any I/O errors we should be
notifying the upper layers with some more details about the cause
of this error.
This patch updates the possible I/O errors to:

- ENOLINK: Link failure between host and target
- EIO: Retryable I/O error
- EREMOTEIO: Non-retryable I/O error

'Retryable' in this context means that an I/O error _might_ be
restricted to the I_T_L nexus (vulgo: path), so retrying on another
nexus / path might succeed.

Signed-off-by: Hannes Reinecke <[email protected]>

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 487ecda..d49b375 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1270,7 +1270,7 @@ static int do_end_io(struct multipath *m, struct request *clone,
if (!error && !clone->errors)
return 0; /* I/O complete */

- if (error == -EOPNOTSUPP)
+ if (error == -EOPNOTSUPP || error == -EREMOTEIO)
return error;

if (clone->cmd_flags & REQ_DISCARD)
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index ce089df..5da040b 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -223,7 +223,7 @@ static inline void scsi_eh_prt_fail_stats(struct Scsi_Host *shost,
* @scmd: Cmd to have sense checked.
*
* Return value:
- * SUCCESS or FAILED or NEEDS_RETRY
+ * SUCCESS or FAILED or NEEDS_RETRY or TARGET_ERROR
*
* Notes:
* When a deferred error is detected the current command has
@@ -338,25 +338,25 @@ static int scsi_check_sense(struct scsi_cmnd *scmd)
case COPY_ABORTED:
case VOLUME_OVERFLOW:
case MISCOMPARE:
- return SUCCESS;
+ case DATA_PROTECT:
+ case BLANK_CHECK:
+ return TARGET_ERROR;

case MEDIUM_ERROR:
if (sshdr.asc == 0x11 || /* UNRECOVERED READ ERR */
sshdr.asc == 0x13 || /* AMNF DATA FIELD */
sshdr.asc == 0x14) { /* RECORD NOT FOUND */
- return SUCCESS;
+ return TARGET_ERROR;
}
return NEEDS_RETRY;

case HARDWARE_ERROR:
if (scmd->device->retry_hwerror)
return ADD_TO_MLQUEUE;
- else
- return SUCCESS;
-
+ else {
+ return TARGET_ERROR;
+ }
case ILLEGAL_REQUEST:
- case BLANK_CHECK:
- case DATA_PROTECT:
default:
return SUCCESS;
}
@@ -819,6 +819,7 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, unsigned char *cmnd,
case SUCCESS:
case NEEDS_RETRY:
case FAILED:
+ case TARGET_ERROR:
break;
case ADD_TO_MLQUEUE:
rtn = NEEDS_RETRY;
@@ -1512,6 +1513,12 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd)
rtn = scsi_check_sense(scmd);
if (rtn == NEEDS_RETRY)
goto maybe_retry;
+ if (rtn == TARGET_ERROR) {
+ /* Need to modify host byte to signal a
+ * permanent target failure */
+ scmd->result |= (DID_TARGET_FAILURE << 16);
+ rtn = SUCCESS;
+ }
/* if rtn == FAILED, we have no sense information;
* returning FAILED will wake the error handler thread
* to collect the sense and redo the decide
@@ -1529,6 +1536,7 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd)
case RESERVATION_CONFLICT:
SCSI_LOG_ERROR_RECOVERY(3, sdev_printk(KERN_INFO, scmd->device,
"reservation conflict\n"));
+ scmd->result |= (DID_TARGET_FAILURE << 16);
return SUCCESS; /* causes immediate i/o error */
default:
return FAILED;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9ade720..fb841e3 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -736,8 +736,20 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
memcpy(req->sense, cmd->sense_buffer, len);
req->sense_len = len;
}
- if (!sense_deferred)
- error = -EIO;
+ if (!sense_deferred) {
+ switch(host_byte(result)) {
+ case DID_TRANSPORT_FAILFAST:
+ error = -ENOLINK;
+ break;
+ case DID_TARGET_FAILURE:
+ cmd->result |= (DID_OK << 16);
+ error = -EREMOTEIO;
+ break;
+ default:
+ error = -EIO;
+ break;
+ }
+ }
}

req->resid_len = scsi_get_resid(cmd);
@@ -796,7 +808,18 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
if (scsi_end_request(cmd, error, good_bytes, result == 0) == NULL)
return;

- error = -EIO;
+ switch (host_byte(result)) {
+ case DID_TRANSPORT_FAILFAST:
+ error = -ENOLINK;
+ break;
+ case DID_TARGET_FAILURE:
+ cmd->result |= (DID_OK << 16);
+ error = -EREMOTEIO;
+ break;
+ default:
+ error = -EIO;
+ break;
+ }

if (host_byte(result) == DID_RESET) {
/* Third party bus reset or reset for error recovery
@@ -1418,7 +1441,6 @@ static void scsi_softirq_done(struct request *rq)
wait_for/HZ);
disposition = SUCCESS;
}
-
scsi_log_completion(cmd, disposition);

switch (disposition) {
diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index 8fcb6e0..abfee76 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -397,6 +397,8 @@ static inline int scsi_is_wlun(unsigned int lun)
* recover the link. Transport class will
* retry or fail IO */
#define DID_TRANSPORT_FAILFAST 0x0f /* Transport class fastfailed the io */
+#define DID_TARGET_FAILURE 0x10 /* Permanent target failure, do not retry on
+ * other paths */
#define DRIVER_OK 0x00 /* Driver status */

/*
@@ -426,6 +428,7 @@ static inline int scsi_is_wlun(unsigned int lun)
#define TIMEOUT_ERROR 0x2007
#define SCSI_RETURN_NOT_HANDLED 0x2008
#define FAST_IO_FAIL 0x2009
+#define TARGET_ERROR 0x200A

/*
* Midlevel queue return values.


Attachments:
scsi-detailed-io-errors (5.50 kB)

2010-08-30 20:36:18

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

Hannes Reinecke, on 08/30/2010 01:54 PM wrote:
>> As I've wrote multiple times, I'm pretty skeptical it will bring much.
>> Ordered tag mandates draining inside the device just like the original
>> barrier implementation. Sure, it's done at a lower layer and command
>> issue latencies will be reduced thanks to that but ordered-by-waiting
>> doesn't require _any_ draining at all. The whole pipeline can be kept
>> full all the time. I'm often wrong tho, so please feel free to go
>> ahead and prove me wrong. :-)
>>
> Actually, I thought about ordered tag writes, too.
> But eventually I had to give up on this for a simple reason:
> Ordered tag controls the ordering on the SCSI _TARGET_. But for a
> meaningful implementation we need to control the ordering all the way
> down from ->queuecommand(). Which means we have three areas we need
> to cover here:
> - driver (ie between ->queuecommand() and passing it off to the firmware)
> - firmware
> - fabric
>
> Sadly, the latter two are really hard to influence. And, what's more,
> with the new/modern CNAs with multiple queues and possible multiple
> routes to the target it becomes impossible to guarantee ordering.
> So using ordered tags for FibreChannel is not going to work, which
> makes implementing it a bit of a pointless exercise for me.

The situation is, actually, much better than you think. An SCSI
transport should provide an in-order delivery of commands. In some
transports it is required (e.g. iSCSI), in some - optional (e.g. FC).
For FC "an application client may determine if a device server supports
the precise delivery function by using the MODE SENSE and MODE SELECT
commands to examine and set the enable precise delivery checking (EPDC)
bit in the Fibre Channel Logical Unit Control page" (Fibre Channel
Protocol for SCSI (FCP)). You can find more details in FCP section
"Precise delivery of SCSI commands".

Regarding multiple queues, in case of a multipath access to a device
SCSI requires either each path be a separate I_T nexus, where order of
commands is maintained, or a transport required to maintain in-order
commands delivery among multiple paths in a single I_T nexus (session)
as it is done in iSCSI's MC/S and, most likely, wide SAS ports.

So, everything is in the specs. We only need to use it properly. How it
can be done on the drivers level as well as how errors recovery can be
done using ACA and UA_INTLCK facilities I wrote few weeks ago in the
"[RFC] relaxed barrier semantics" thread.

Vlad

2010-09-01 00:56:36

by Mike Snitzer

[permalink] [raw]
Subject: safety of retrying SYNCHRONIZE CACHE [was: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush]

Hi Kiyoshi,

On Mon, Aug 30 2010 at 2:13am -0400,
Kiyoshi Ueda <[email protected]> wrote:

> > That does seem like a valid concern. But I'm not seeing why its unique
> > to SYNCHRONIZE CACHE. Any IO that fails on the target side should be
> > passed up once the error gets to DM.
>
> See the Tejun's explanation again:
> http://marc.info/?l=linux-kernel&m=128267361813859&w=2
> What I'm concerning is whether the same thing as Tejun explained
> for ATA can happen on other types of devices.
>
>
> Normal write command has data and no data loss happens on error.
> So it can be retried cleanly, and if the result of the retry is
> success, it's really success, no implicit data loss.
>
> Normal read command has a sector to read. If the sector is broken,
> all retries will fail and the error will be reported upwards.
> So it can be retried cleanly as well.

I reached out to Fred Knight on this, to get a more insight from a pure
SCSI SBC perspective, and he shared the following:

----- Forwarded message from "Knight, Frederick" <[email protected]> -----

> Date: Tue, 31 Aug 2010 13:24:15 -0400
> From: "Knight, Frederick" <[email protected]>
> To: Mike Snitzer <[email protected]>
> Subject: RE: safety of retrying SYNCHRONIZE CACHE?
>
> There are requirements in SBC to maintain data integrity. If you WRITE
> a block and READ that block, you must get the data you sent in the
> WRITE. This will be synchronized around the completion of the WRITE.
> Before the WRITE completes, who knows what a READ will return. Maybe
> all the old data, maybe all the new data, maybe some mix of old and new
> data. Once the WRITE ends successful, all READs of those LBAs (from any
> port) will always get the same data.
>
> As for errors, SBC describes how the deferred errors are reported (like
> when a CACHE tries to flush but fails). So if a write from cache to
> media does have problems, the device would tell you via a CHECK
> CONDITION (with the first byte of the sense data set to 71h or 73h. SBC
> clause 4.12 and 4.13 cover a lot of this information. It is these error
> codes that prevent silent loss of data. And, in this case, when the
> CHECK CONDITION is delivered, it will have nothing to do with the
> command that was issued (the victim command). If you look into the
> sense data, you will see the deferred error flag, and all the additional
> information fields will relate to the original I/O
>
> SYNCHRONIZE CACHE is not substantially different than a WRITE (it puts
> data on the media). So issuing it multiple times wouldn't be any
> different than issuing multiple WRITES (it might put a temporary dent in
> performance as everything flushes out to media). If it or any other
> commands fail with 71h/73h, then you have to dig down into the sense
> data buffer to find out what happened. For example, if you issue a
> WRITE command, and it completes into write back cache but later (before
> being written to the media), some of the cache breaks and looses data,
> then the device must signal a deferred error to tell the host, and cause
> a forced error on the LBA in question.
>
> Does that help?
>
> Fred
----- End forwarded message -----

Seems like verifying/improving the handling of CHECK CONDITION is a more
pressing concern than silent data loss purely due to SYNCHRONIZE CACHE
retries. Without proper handling we could completely miss these
deferred errors.

But how to effectively report such errors to upper layers is unclear to
me given that a particular SCSI command can carry error information for
IO that was already acknowledged successful (e.g. to the FS).

drivers/scsi/scsi_error.c's various calls to scsi_check_sense()
illustrate Linux's current CHECK CONDITION handling. I need to look
closer at how deferred errors propagate to upper layers. After an
initial look it seems scsi_error.c does handle retrying commands where
appropriate.

I believe Hannes has concerns/insight here.

Mike