2010-08-16 16:55:39

by Tejun Heo

[permalink] [raw]
Subject: [RFC PATCHSET block#for-2.6.36-post] block: convert to REQ_FLUSH/FUA

Hello,

This patchset contains five patches to convert block drivers
implementing REQ_HARDBARRIER to support REQ_FLUSH/FUA.

0001-block-loop-implement-REQ_FLUSH-FUA-support.patch
0002-virtio_blk-implement-REQ_FLUSH-FUA-support.patch
0003-lguest-replace-VIRTIO_F_BARRIER-support-with-VIRTIO_.patch
0004-md-implment-REQ_FLUSH-FUA-support.patch
0005-dm-implement-REQ_FLUSH-FUA-support.patch

I'm fairly sure about conversions 0001-0003. 0004 should be okay
although multipath wasn't tested. 0005, I'm not quite sure about. It
works fine for the tests I've done but there are many other targets
and code paths that I didn't test. So, please be careful with the
last patch. I think it would be best to route the last two through
the respective md/dm trees after the core part is merged and pulled
into those trees.

The nice thing about the conversion is that in many cases it replaces
postflush with FUA writes which can be handled by request queues lower
in the chain. For md/dm, this replaces an array wide cacheflush with
FUA writes to only affected member devices.

After this patchset, the followings remain to be converted.

* blktrace

* scsi_error.c for some reason tests REQ_HARDBARRIER. I think this
part can be dropped altogether but am not sure.

* drbd and xen. I have no idea.

These patches are on top of

block#for-2.6.36-post (c047ab2dddeeafbd6f7c00e45a13a5c4da53ea0b)
+ block-replace-barrier-with-sequenced-flush patchset[1]
+ block-fix-incorrect-bio-request-flag-conversion-in-md patch[2]

and available in the following git tree.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua

and contain the following changes.

Documentation/lguest/lguest.c | 36 +++-----
drivers/block/loop.c | 18 ++--
drivers/block/virtio_blk.c | 26 ++---
drivers/md/dm-crypt.c | 2
drivers/md/dm-io.c | 20 ----
drivers/md/dm-log.c | 2
drivers/md/dm-raid1.c | 8 -
drivers/md/dm-region-hash.c | 16 +--
drivers/md/dm-snap-persistent.c | 2
drivers/md/dm-snap.c | 6 -
drivers/md/dm-stripe.c | 2
drivers/md/dm.c | 180 ++++++++++++++++++++--------------------
drivers/md/linear.c | 4
drivers/md/md.c | 117 +++++---------------------
drivers/md/md.h | 23 +----
drivers/md/multipath.c | 4
drivers/md/raid0.c | 4
drivers/md/raid1.c | 175 +++++++++++++-------------------------
drivers/md/raid1.h | 2
drivers/md/raid10.c | 7 -
drivers/md/raid5.c | 38 ++++----
drivers/md/raid5.h | 1
include/linux/virtio_blk.h | 6 +
23 files changed, 278 insertions(+), 421 deletions(-)

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel/1022363
[2] http://thread.gmane.org/gmane.linux.kernel/1023435


2010-08-16 16:55:40

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 4/5] md: implment REQ_FLUSH/FUA support

From: Tejun Heo <[email protected]>

This patch converts md to support REQ_FLUSH/FUA instead of now
deprecated REQ_HARDBARRIER. In the core part (md.c), the following
changes are notable.

* Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
processing of other requests and thus there is no reason to mark the
queue congested while FLUSH/FUA is in progress.

* REQ_FLUSH/FUA failures are final and its users don't need retry
logic. Retry logic is removed.

* Preflush needs to be issued to all member devices but FUA writes can
be handled the same way as other writes - their processing can be
deferred to request_queue of member devices. md_barrier_request()
is renamed to md_flush_request() and simplified accordingly.

For linear, raid0 and multipath, the core changes are enough. raid1,
5 and 10 need the following conversions.

* raid1: Handling of FLUSH/FUA bio's can simply be deferred to
request_queues of member devices. Barrier related logic removed.

* raid5: Queue draining logic dropped. FUA bit is propagated through
biodrain and stripe resconstruction such that all the updated parts
of the stripe are written out with FUA writes if any of the dirtying
writes was FUA. I think preread_active wait logic can be dropped
but wasn't sure. If it can be dropped, please go ahead and drop it.

* raid10: FUA bit needs to be propagated to write clones.

linear, raid0, 1, 5 and 10 tested.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Neil Brown <[email protected]>
---
drivers/md/linear.c | 4 +-
drivers/md/md.c | 117 +++++++-------------------------
drivers/md/md.h | 23 ++-----
drivers/md/multipath.c | 4 +-
drivers/md/raid0.c | 4 +-
drivers/md/raid1.c | 175 ++++++++++++++++--------------------------------
drivers/md/raid1.h | 2 -
drivers/md/raid10.c | 7 +-
drivers/md/raid5.c | 38 ++++++-----
drivers/md/raid5.h | 1 +
10 files changed, 123 insertions(+), 252 deletions(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index ba19060..8a2f767 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -294,8 +294,8 @@ static int linear_make_request (mddev_t *mddev, struct bio *bio)
dev_info_t *tmp_dev;
sector_t start_sector;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 700c96e..91b2929 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -226,12 +226,12 @@ static int md_make_request(struct request_queue *q, struct bio *bio)
return 0;
}
rcu_read_lock();
- if (mddev->suspended || mddev->barrier) {
+ if (mddev->suspended) {
DEFINE_WAIT(__wait);
for (;;) {
prepare_to_wait(&mddev->sb_wait, &__wait,
TASK_UNINTERRUPTIBLE);
- if (!mddev->suspended && !mddev->barrier)
+ if (!mddev->suspended)
break;
rcu_read_unlock();
schedule();
@@ -280,40 +280,29 @@ static void mddev_resume(mddev_t *mddev)

int mddev_congested(mddev_t *mddev, int bits)
{
- if (mddev->barrier)
- return 1;
return mddev->suspended;
}
EXPORT_SYMBOL(mddev_congested);

/*
- * Generic barrier handling for md
+ * Generic flush handling for md
*/

-#define POST_REQUEST_BARRIER ((void*)1)
-
-static void md_end_barrier(struct bio *bio, int err)
+static void md_end_flush(struct bio *bio, int err)
{
mdk_rdev_t *rdev = bio->bi_private;
mddev_t *mddev = rdev->mddev;
- if (err == -EOPNOTSUPP && mddev->barrier != POST_REQUEST_BARRIER)
- set_bit(BIO_EOPNOTSUPP, &mddev->barrier->bi_flags);

rdev_dec_pending(rdev, mddev);

if (atomic_dec_and_test(&mddev->flush_pending)) {
- if (mddev->barrier == POST_REQUEST_BARRIER) {
- /* This was a post-request barrier */
- mddev->barrier = NULL;
- wake_up(&mddev->sb_wait);
- } else
- /* The pre-request barrier has finished */
- schedule_work(&mddev->barrier_work);
+ /* The pre-request flush has finished */
+ schedule_work(&mddev->flush_work);
}
bio_put(bio);
}

-static void submit_barriers(mddev_t *mddev)
+static void submit_flushes(mddev_t *mddev)
{
mdk_rdev_t *rdev;

@@ -330,60 +319,56 @@ static void submit_barriers(mddev_t *mddev)
atomic_inc(&rdev->nr_pending);
rcu_read_unlock();
bi = bio_alloc(GFP_KERNEL, 0);
- bi->bi_end_io = md_end_barrier;
+ bi->bi_end_io = md_end_flush;
bi->bi_private = rdev;
bi->bi_bdev = rdev->bdev;
atomic_inc(&mddev->flush_pending);
- submit_bio(WRITE_BARRIER, bi);
+ submit_bio(WRITE_FLUSH, bi);
rcu_read_lock();
rdev_dec_pending(rdev, mddev);
}
rcu_read_unlock();
}

-static void md_submit_barrier(struct work_struct *ws)
+static void md_submit_flush_data(struct work_struct *ws)
{
- mddev_t *mddev = container_of(ws, mddev_t, barrier_work);
- struct bio *bio = mddev->barrier;
+ mddev_t *mddev = container_of(ws, mddev_t, flush_work);
+ struct bio *bio = mddev->flush_bio;

atomic_set(&mddev->flush_pending, 1);

- if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags))
- bio_endio(bio, -EOPNOTSUPP);
- else if (bio->bi_size == 0)
+ if (bio->bi_size == 0)
/* an empty barrier - all done */
bio_endio(bio, 0);
else {
- bio->bi_rw &= ~REQ_HARDBARRIER;
+ bio->bi_rw &= ~REQ_FLUSH;
if (mddev->pers->make_request(mddev, bio))
generic_make_request(bio);
- mddev->barrier = POST_REQUEST_BARRIER;
- submit_barriers(mddev);
}
if (atomic_dec_and_test(&mddev->flush_pending)) {
- mddev->barrier = NULL;
+ mddev->flush_bio = NULL;
wake_up(&mddev->sb_wait);
}
}

-void md_barrier_request(mddev_t *mddev, struct bio *bio)
+void md_flush_request(mddev_t *mddev, struct bio *bio)
{
spin_lock_irq(&mddev->write_lock);
wait_event_lock_irq(mddev->sb_wait,
- !mddev->barrier,
+ !mddev->flush_bio,
mddev->write_lock, /*nothing*/);
- mddev->barrier = bio;
+ mddev->flush_bio = bio;
spin_unlock_irq(&mddev->write_lock);

atomic_set(&mddev->flush_pending, 1);
- INIT_WORK(&mddev->barrier_work, md_submit_barrier);
+ INIT_WORK(&mddev->flush_work, md_submit_flush_data);

- submit_barriers(mddev);
+ submit_flushes(mddev);

if (atomic_dec_and_test(&mddev->flush_pending))
- schedule_work(&mddev->barrier_work);
+ schedule_work(&mddev->flush_work);
}
-EXPORT_SYMBOL(md_barrier_request);
+EXPORT_SYMBOL(md_flush_request);

static inline mddev_t *mddev_get(mddev_t *mddev)
{
@@ -642,31 +627,6 @@ static void super_written(struct bio *bio, int error)
bio_put(bio);
}

-static void super_written_barrier(struct bio *bio, int error)
-{
- struct bio *bio2 = bio->bi_private;
- mdk_rdev_t *rdev = bio2->bi_private;
- mddev_t *mddev = rdev->mddev;
-
- if (!test_bit(BIO_UPTODATE, &bio->bi_flags) &&
- error == -EOPNOTSUPP) {
- unsigned long flags;
- /* barriers don't appear to be supported :-( */
- set_bit(BarriersNotsupp, &rdev->flags);
- mddev->barriers_work = 0;
- spin_lock_irqsave(&mddev->write_lock, flags);
- bio2->bi_next = mddev->biolist;
- mddev->biolist = bio2;
- spin_unlock_irqrestore(&mddev->write_lock, flags);
- wake_up(&mddev->sb_wait);
- bio_put(bio);
- } else {
- bio_put(bio2);
- bio->bi_private = rdev;
- super_written(bio, error);
- }
-}
-
void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
sector_t sector, int size, struct page *page)
{
@@ -675,51 +635,28 @@ void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
* and decrement it on completion, waking up sb_wait
* if zero is reached.
* If an error occurred, call md_error
- *
- * As we might need to resubmit the request if REQ_HARDBARRIER
- * causes ENOTSUPP, we allocate a spare bio...
*/
struct bio *bio = bio_alloc(GFP_NOIO, 1);
- int rw = REQ_WRITE | REQ_SYNC | REQ_UNPLUG;

bio->bi_bdev = rdev->bdev;
bio->bi_sector = sector;
bio_add_page(bio, page, size, 0);
bio->bi_private = rdev;
bio->bi_end_io = super_written;
- bio->bi_rw = rw;

atomic_inc(&mddev->pending_writes);
- if (!test_bit(BarriersNotsupp, &rdev->flags)) {
- struct bio *rbio;
- rw |= REQ_HARDBARRIER;
- rbio = bio_clone(bio, GFP_NOIO);
- rbio->bi_private = bio;
- rbio->bi_end_io = super_written_barrier;
- submit_bio(rw, rbio);
- } else
- submit_bio(rw, bio);
+ submit_bio(REQ_WRITE | REQ_SYNC | REQ_UNPLUG | REQ_FLUSH | REQ_FUA,
+ bio);
}

void md_super_wait(mddev_t *mddev)
{
- /* wait for all superblock writes that were scheduled to complete.
- * if any had to be retried (due to BARRIER problems), retry them
- */
+ /* wait for all superblock writes that were scheduled to complete */
DEFINE_WAIT(wq);
for(;;) {
prepare_to_wait(&mddev->sb_wait, &wq, TASK_UNINTERRUPTIBLE);
if (atomic_read(&mddev->pending_writes)==0)
break;
- while (mddev->biolist) {
- struct bio *bio;
- spin_lock_irq(&mddev->write_lock);
- bio = mddev->biolist;
- mddev->biolist = bio->bi_next ;
- bio->bi_next = NULL;
- spin_unlock_irq(&mddev->write_lock);
- submit_bio(bio->bi_rw, bio);
- }
schedule();
}
finish_wait(&mddev->sb_wait, &wq);
@@ -1016,7 +953,6 @@ static int super_90_validate(mddev_t *mddev, mdk_rdev_t *rdev)
clear_bit(Faulty, &rdev->flags);
clear_bit(In_sync, &rdev->flags);
clear_bit(WriteMostly, &rdev->flags);
- clear_bit(BarriersNotsupp, &rdev->flags);

if (mddev->raid_disks == 0) {
mddev->major_version = 0;
@@ -1431,7 +1367,6 @@ static int super_1_validate(mddev_t *mddev, mdk_rdev_t *rdev)
clear_bit(Faulty, &rdev->flags);
clear_bit(In_sync, &rdev->flags);
clear_bit(WriteMostly, &rdev->flags);
- clear_bit(BarriersNotsupp, &rdev->flags);

if (mddev->raid_disks == 0) {
mddev->major_version = 1;
@@ -4463,7 +4398,6 @@ static int md_run(mddev_t *mddev)
/* may be over-ridden by personality */
mddev->resync_max_sectors = mddev->dev_sectors;

- mddev->barriers_work = 1;
mddev->ok_start_degraded = start_dirty_degraded;

if (start_readonly && mddev->ro == 0)
@@ -4638,7 +4572,6 @@ static void md_clean(mddev_t *mddev)
mddev->recovery = 0;
mddev->in_sync = 0;
mddev->degraded = 0;
- mddev->barriers_work = 0;
mddev->safemode = 0;
mddev->bitmap_info.offset = 0;
mddev->bitmap_info.default_offset = 0;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index fc56e0f..de4a365 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -67,7 +67,6 @@ struct mdk_rdev_s
#define Faulty 1 /* device is known to have a fault */
#define In_sync 2 /* device is in_sync with rest of array */
#define WriteMostly 4 /* Avoid reading if at all possible */
-#define BarriersNotsupp 5 /* REQ_HARDBARRIER is not supported */
#define AllReserved 6 /* If whole device is reserved for
* one array */
#define AutoDetected 7 /* added by auto-detect */
@@ -249,13 +248,6 @@ struct mddev_s
int degraded; /* whether md should consider
* adding a spare
*/
- int barriers_work; /* initialised to true, cleared as soon
- * as a barrier request to slave
- * fails. Only supported
- */
- struct bio *biolist; /* bios that need to be retried
- * because REQ_HARDBARRIER is not supported
- */

atomic_t recovery_active; /* blocks scheduled, but not written */
wait_queue_head_t recovery_wait;
@@ -308,16 +300,13 @@ struct mddev_s
struct list_head all_mddevs;

struct attribute_group *to_remove;
- /* Generic barrier handling.
- * If there is a pending barrier request, all other
- * writes are blocked while the devices are flushed.
- * The last to finish a flush schedules a worker to
- * submit the barrier request (without the barrier flag),
- * then submit more flush requests.
+ /* Generic flush handling.
+ * The last to finish preflush schedules a worker to submit
+ * the rest of the request (without the REQ_FLUSH flag).
*/
- struct bio *barrier;
+ struct bio *flush_bio;
atomic_t flush_pending;
- struct work_struct barrier_work;
+ struct work_struct flush_work;
};


@@ -458,7 +447,7 @@ extern void md_done_sync(mddev_t *mddev, int blocks, int ok);
extern void md_error(mddev_t *mddev, mdk_rdev_t *rdev);

extern int mddev_congested(mddev_t *mddev, int bits);
-extern void md_barrier_request(mddev_t *mddev, struct bio *bio);
+extern void md_flush_request(mddev_t *mddev, struct bio *bio);
extern void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
sector_t sector, int size, struct page *page);
extern void md_super_wait(mddev_t *mddev);
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 0307d21..6d7ddf3 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -142,8 +142,8 @@ static int multipath_make_request(mddev_t *mddev, struct bio * bio)
struct multipath_bh * mp_bh;
struct multipath_info *multipath;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 6f7af46..a39f4c3 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -483,8 +483,8 @@ static int raid0_make_request(mddev_t *mddev, struct bio *bio)
struct strip_zone *zone;
mdk_rdev_t *tmp_dev;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 1d7b6a8..4b628b7 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -319,83 +319,74 @@ static void raid1_end_write_request(struct bio *bio, int error)
if (r1_bio->bios[mirror] == bio)
break;

- if (error == -EOPNOTSUPP && test_bit(R1BIO_Barrier, &r1_bio->state)) {
- set_bit(BarriersNotsupp, &conf->mirrors[mirror].rdev->flags);
- set_bit(R1BIO_BarrierRetry, &r1_bio->state);
- r1_bio->mddev->barriers_work = 0;
- /* Don't rdev_dec_pending in this branch - keep it for the retry */
- } else {
+ /*
+ * 'one mirror IO has finished' event handler:
+ */
+ r1_bio->bios[mirror] = NULL;
+ to_put = bio;
+ if (!uptodate) {
+ md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
+ /* an I/O failed, we can't clear the bitmap */
+ set_bit(R1BIO_Degraded, &r1_bio->state);
+ } else
/*
- * this branch is our 'one mirror IO has finished' event handler:
+ * Set R1BIO_Uptodate in our master bio, so that we
+ * will return a good error code for to the higher
+ * levels even if IO on some other mirrored buffer
+ * fails.
+ *
+ * The 'master' represents the composite IO operation
+ * to user-side. So if something waits for IO, then it
+ * will wait for the 'master' bio.
*/
- r1_bio->bios[mirror] = NULL;
- to_put = bio;
- if (!uptodate) {
- md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
- /* an I/O failed, we can't clear the bitmap */
- set_bit(R1BIO_Degraded, &r1_bio->state);
- } else
- /*
- * Set R1BIO_Uptodate in our master bio, so that
- * we will return a good error code for to the higher
- * levels even if IO on some other mirrored buffer fails.
- *
- * The 'master' represents the composite IO operation to
- * user-side. So if something waits for IO, then it will
- * wait for the 'master' bio.
- */
- set_bit(R1BIO_Uptodate, &r1_bio->state);
-
- update_head_pos(mirror, r1_bio);
-
- if (behind) {
- if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
- atomic_dec(&r1_bio->behind_remaining);
-
- /* In behind mode, we ACK the master bio once the I/O has safely
- * reached all non-writemostly disks. Setting the Returned bit
- * ensures that this gets done only once -- we don't ever want to
- * return -EIO here, instead we'll wait */
-
- if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
- test_bit(R1BIO_Uptodate, &r1_bio->state)) {
- /* Maybe we can return now */
- if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
- struct bio *mbio = r1_bio->master_bio;
- PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu\n",
- (unsigned long long) mbio->bi_sector,
- (unsigned long long) mbio->bi_sector +
- (mbio->bi_size >> 9) - 1);
- bio_endio(mbio, 0);
- }
+ set_bit(R1BIO_Uptodate, &r1_bio->state);
+
+ update_head_pos(mirror, r1_bio);
+
+ if (behind) {
+ if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
+ atomic_dec(&r1_bio->behind_remaining);
+
+ /*
+ * In behind mode, we ACK the master bio once the I/O
+ * has safely reached all non-writemostly
+ * disks. Setting the Returned bit ensures that this
+ * gets done only once -- we don't ever want to return
+ * -EIO here, instead we'll wait
+ */
+ if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
+ test_bit(R1BIO_Uptodate, &r1_bio->state)) {
+ /* Maybe we can return now */
+ if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
+ struct bio *mbio = r1_bio->master_bio;
+ PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu\n",
+ (unsigned long long) mbio->bi_sector,
+ (unsigned long long) mbio->bi_sector +
+ (mbio->bi_size >> 9) - 1);
+ bio_endio(mbio, 0);
}
}
- rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
}
+ rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
+
/*
- *
* Let's see if all mirrored write operations have finished
* already.
*/
if (atomic_dec_and_test(&r1_bio->remaining)) {
- if (test_bit(R1BIO_BarrierRetry, &r1_bio->state))
- reschedule_retry(r1_bio);
- else {
- /* it really is the end of this request */
- if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
- /* free extra copy of the data pages */
- int i = bio->bi_vcnt;
- while (i--)
- safe_put_page(bio->bi_io_vec[i].bv_page);
- }
- /* clear the bitmap if all writes complete successfully */
- bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
- r1_bio->sectors,
- !test_bit(R1BIO_Degraded, &r1_bio->state),
- behind);
- md_write_end(r1_bio->mddev);
- raid_end_bio_io(r1_bio);
+ if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
+ /* free extra copy of the data pages */
+ int i = bio->bi_vcnt;
+ while (i--)
+ safe_put_page(bio->bi_io_vec[i].bv_page);
}
+ /* clear the bitmap if all writes complete successfully */
+ bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
+ r1_bio->sectors,
+ !test_bit(R1BIO_Degraded, &r1_bio->state),
+ behind);
+ md_write_end(r1_bio->mddev);
+ raid_end_bio_io(r1_bio);
}

if (to_put)
@@ -788,6 +779,7 @@ static int make_request(mddev_t *mddev, struct bio * bio)
struct page **behind_pages = NULL;
const int rw = bio_data_dir(bio);
const unsigned int do_sync = (bio->bi_rw & REQ_SYNC);
+ const unsigned int do_flush_fua = (bio->bi_rw & (REQ_FLUSH | REQ_FUA));
bool do_barriers;
mdk_rdev_t *blocked_rdev;

@@ -795,9 +787,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
* Register the new request and wait if the reconstruction
* thread has put up a bar for new requests.
* Continue immediately if no resync is active currently.
- * We test barriers_work *after* md_write_start as md_write_start
- * may cause the first superblock write, and that will check out
- * if barriers work.
*/

md_write_start(mddev, bio); /* wait on superblock update early */
@@ -821,13 +810,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
}
finish_wait(&conf->wait_barrier, &w);
}
- if (unlikely(!mddev->barriers_work &&
- (bio->bi_rw & REQ_HARDBARRIER))) {
- if (rw == WRITE)
- md_write_end(mddev);
- bio_endio(bio, -EOPNOTSUPP);
- return 0;
- }

wait_barrier(conf);

@@ -959,10 +941,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
atomic_set(&r1_bio->remaining, 0);
atomic_set(&r1_bio->behind_remaining, 0);

- do_barriers = bio->bi_rw & REQ_HARDBARRIER;
- if (do_barriers)
- set_bit(R1BIO_Barrier, &r1_bio->state);
-
bio_list_init(&bl);
for (i = 0; i < disks; i++) {
struct bio *mbio;
@@ -975,7 +953,7 @@ static int make_request(mddev_t *mddev, struct bio * bio)
mbio->bi_sector = r1_bio->sector + conf->mirrors[i].rdev->data_offset;
mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
mbio->bi_end_io = raid1_end_write_request;
- mbio->bi_rw = WRITE | do_barriers | do_sync;
+ mbio->bi_rw = WRITE | do_flush_fua | do_sync;
mbio->bi_private = r1_bio;

if (behind_pages) {
@@ -1631,41 +1609,6 @@ static void raid1d(mddev_t *mddev)
if (test_bit(R1BIO_IsSync, &r1_bio->state)) {
sync_request_write(mddev, r1_bio);
unplug = 1;
- } else if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) {
- /* some requests in the r1bio were REQ_HARDBARRIER
- * requests which failed with -EOPNOTSUPP. Hohumm..
- * Better resubmit without the barrier.
- * We know which devices to resubmit for, because
- * all others have had their bios[] entry cleared.
- * We already have a nr_pending reference on these rdevs.
- */
- int i;
- const bool do_sync = (r1_bio->master_bio->bi_rw & REQ_SYNC);
- clear_bit(R1BIO_BarrierRetry, &r1_bio->state);
- clear_bit(R1BIO_Barrier, &r1_bio->state);
- for (i=0; i < conf->raid_disks; i++)
- if (r1_bio->bios[i])
- atomic_inc(&r1_bio->remaining);
- for (i=0; i < conf->raid_disks; i++)
- if (r1_bio->bios[i]) {
- struct bio_vec *bvec;
- int j;
-
- bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
- /* copy pages from the failed bio, as
- * this might be a write-behind device */
- __bio_for_each_segment(bvec, bio, j, 0)
- bvec->bv_page = bio_iovec_idx(r1_bio->bios[i], j)->bv_page;
- bio_put(r1_bio->bios[i]);
- bio->bi_sector = r1_bio->sector +
- conf->mirrors[i].rdev->data_offset;
- bio->bi_bdev = conf->mirrors[i].rdev->bdev;
- bio->bi_end_io = raid1_end_write_request;
- bio->bi_rw = WRITE | do_sync;
- bio->bi_private = r1_bio;
- r1_bio->bios[i] = bio;
- generic_make_request(bio);
- }
} else {
int disk;

diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index 5f2d443..adf8cfd 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -117,8 +117,6 @@ struct r1bio_s {
#define R1BIO_IsSync 1
#define R1BIO_Degraded 2
#define R1BIO_BehindIO 3
-#define R1BIO_Barrier 4
-#define R1BIO_BarrierRetry 5
/* For write-behind requests, we call bi_end_io when
* the last non-write-behind device completes, providing
* any write was successful. Otherwise we call when
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 5551ccf..cc00c10 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -800,12 +800,13 @@ static int make_request(mddev_t *mddev, struct bio * bio)
int chunk_sects = conf->chunk_mask + 1;
const int rw = bio_data_dir(bio);
const unsigned int do_sync = (bio->bi_rw & REQ_SYNC);
+ const unsigned int do_fua = (bio->bi_rw & REQ_FUA);
struct bio_list bl;
unsigned long flags;
mdk_rdev_t *blocked_rdev;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

@@ -947,7 +948,7 @@ static int make_request(mddev_t *mddev, struct bio * bio)
conf->mirrors[d].rdev->data_offset;
mbio->bi_bdev = conf->mirrors[d].rdev->bdev;
mbio->bi_end_io = raid10_end_write_request;
- mbio->bi_rw = WRITE | do_sync;
+ mbio->bi_rw = WRITE | do_sync | do_fua;
mbio->bi_private = r10_bio;

atomic_inc(&r10_bio->remaining);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 20ac2f1..7eabd99 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -507,9 +507,12 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
int rw;
struct bio *bi;
mdk_rdev_t *rdev;
- if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags))
- rw = WRITE;
- else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
+ if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags)) {
+ if (test_and_clear_bit(R5_WantFUA, &sh->dev[i].flags))
+ rw = WRITE_FUA;
+ else
+ rw = WRITE;
+ } else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
rw = READ;
else
continue;
@@ -1032,6 +1035,8 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)

while (wbi && wbi->bi_sector <
dev->sector + STRIPE_SECTORS) {
+ if (wbi->bi_rw & REQ_FUA)
+ set_bit(R5_WantFUA, &dev->flags);
tx = async_copy_data(1, wbi, dev->page,
dev->sector, tx);
wbi = r5_next_bio(wbi, dev->sector);
@@ -1049,15 +1054,22 @@ static void ops_complete_reconstruct(void *stripe_head_ref)
int pd_idx = sh->pd_idx;
int qd_idx = sh->qd_idx;
int i;
+ bool fua = false;

pr_debug("%s: stripe %llu\n", __func__,
(unsigned long long)sh->sector);

+ for (i = disks; i--; )
+ fua |= test_bit(R5_WantFUA, &sh->dev[i].flags);
+
for (i = disks; i--; ) {
struct r5dev *dev = &sh->dev[i];

- if (dev->written || i == pd_idx || i == qd_idx)
+ if (dev->written || i == pd_idx || i == qd_idx) {
set_bit(R5_UPTODATE, &dev->flags);
+ if (fua)
+ set_bit(R5_WantFUA, &dev->flags);
+ }
}

if (sh->reconstruct_state == reconstruct_state_drain_run)
@@ -3278,7 +3290,7 @@ static void handle_stripe5(struct stripe_head *sh)

if (dec_preread_active) {
/* We delay this until after ops_run_io so that if make_request
- * is waiting on a barrier, it won't continue until the writes
+ * is waiting on a flush, it won't continue until the writes
* have actually been submitted.
*/
atomic_dec(&conf->preread_active_stripes);
@@ -3580,7 +3592,7 @@ static void handle_stripe6(struct stripe_head *sh)

if (dec_preread_active) {
/* We delay this until after ops_run_io so that if make_request
- * is waiting on a barrier, it won't continue until the writes
+ * is waiting on a flush, it won't continue until the writes
* have actually been submitted.
*/
atomic_dec(&conf->preread_active_stripes);
@@ -3958,14 +3970,8 @@ static int make_request(mddev_t *mddev, struct bio * bi)
const int rw = bio_data_dir(bi);
int remaining;

- if (unlikely(bi->bi_rw & REQ_HARDBARRIER)) {
- /* Drain all pending writes. We only really need
- * to ensure they have been submitted, but this is
- * easier.
- */
- mddev->pers->quiesce(mddev, 1);
- mddev->pers->quiesce(mddev, 0);
- md_barrier_request(mddev, bi);
+ if (unlikely(bi->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bi);
return 0;
}

@@ -4083,7 +4089,7 @@ static int make_request(mddev_t *mddev, struct bio * bi)
finish_wait(&conf->wait_for_overlap, &w);
set_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);
- if (mddev->barrier &&
+ if (mddev->flush_bio &&
!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
atomic_inc(&conf->preread_active_stripes);
release_stripe(sh);
@@ -4106,7 +4112,7 @@ static int make_request(mddev_t *mddev, struct bio * bi)
bio_endio(bi, 0);
}

- if (mddev->barrier) {
+ if (mddev->flush_bio) {
/* We need to wait for the stripes to all be handled.
* So: wait for preread_active_stripes to drop to 0.
*/
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 0f86f5e..ff9cad2 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -275,6 +275,7 @@ struct r6_state {
* filling
*/
#define R5_Wantdrain 13 /* dev->towrite needs to be drained */
+#define R5_WantFUA 14 /* Write should be FUA */
/*
* Write method
*/
--
1.7.1

2010-08-16 16:55:37

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/5] block/loop: implement REQ_FLUSH/FUA support

From: Tejun Heo <[email protected]>

Deprecate REQ_HARDBARRIER and implement REQ_FLUSH/FUA instead. Also,
instead of checking file->f_op->fsync() directly, look at the value of
vfs_fsync() and ignore -EINVAL return.

Signed-off-by: Tejun Heo <[email protected]>
---
drivers/block/loop.c | 18 +++++++++---------
1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 953d1e1..5d27bc6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -477,17 +477,17 @@ static int do_bio_filebacked(struct loop_device *lo, struct bio *bio)
pos = ((loff_t) bio->bi_sector << 9) + lo->lo_offset;

if (bio_rw(bio) == WRITE) {
- bool barrier = (bio->bi_rw & REQ_HARDBARRIER);
struct file *file = lo->lo_backing_file;

- if (barrier) {
- if (unlikely(!file->f_op->fsync)) {
- ret = -EOPNOTSUPP;
- goto out;
- }
+ /* REQ_HARDBARRIER is deprecated */
+ if (bio->bi_rw & REQ_HARDBARRIER) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }

+ if (bio->bi_rw & REQ_FLUSH) {
ret = vfs_fsync(file, 0);
- if (unlikely(ret)) {
+ if (unlikely(ret && ret != -EINVAL)) {
ret = -EIO;
goto out;
}
@@ -495,9 +495,9 @@ static int do_bio_filebacked(struct loop_device *lo, struct bio *bio)

ret = lo_send(lo, bio, pos);

- if (barrier && !ret) {
+ if ((bio->bi_rw & REQ_FUA) && !ret) {
ret = vfs_fsync(file, 0);
- if (unlikely(ret))
+ if (unlikely(ret && ret != -EINVAL))
ret = -EIO;
}
} else
--
1.7.1

2010-08-16 16:55:38

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/5] virtio_blk: implement REQ_FLUSH/FUA support

From: Tejun Heo <[email protected]>

Remove now unused REQ_HARDBARRIER support and implement REQ_FLUSH/FUA
support instead. A new feature flag VIRTIO_BLK_F_FUA is added to
indicate the support for FUA.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
---
drivers/block/virtio_blk.c | 26 ++++++++++++--------------
include/linux/virtio_blk.h | 6 +++++-
2 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index d10b635..ed0fb7d 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -128,9 +128,6 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
}
}

- if (vbr->req->cmd_flags & REQ_HARDBARRIER)
- vbr->out_hdr.type |= VIRTIO_BLK_T_BARRIER;
-
sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));

/*
@@ -157,6 +154,8 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
if (rq_data_dir(vbr->req) == WRITE) {
vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
out += num;
+ if (req->cmd_flags & REQ_FUA)
+ vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;
} else {
vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
in += num;
@@ -307,6 +306,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
{
struct virtio_blk *vblk;
struct request_queue *q;
+ unsigned int flush;
int err;
u64 cap;
u32 v, blk_size, sg_elems, opt_io_size;
@@ -388,15 +388,13 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
vblk->disk->driverfs_dev = &vdev->dev;
index++;

- /*
- * If the FLUSH feature is supported we do have support for
- * flushing a volatile write cache on the host. Use that to
- * implement write barrier support; otherwise, we must assume
- * that the host does not perform any kind of volatile write
- * caching.
- */
+ /* configure queue flush support */
+ flush = 0;
if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
- blk_queue_flush(q, REQ_FLUSH);
+ flush |= REQ_FLUSH;
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_FUA))
+ flush |= REQ_FUA;
+ blk_queue_flush(q, flush);

/* If disk is read-only in the host, the guest should obey */
if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
@@ -515,9 +513,9 @@ static const struct virtio_device_id id_table[] = {
};

static unsigned int features[] = {
- VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
- VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
- VIRTIO_BLK_F_SCSI, VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY
+ VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
+ VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_SCSI,
+ VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_FUA,
};

/*
diff --git a/include/linux/virtio_blk.h b/include/linux/virtio_blk.h
index 167720d..f453f3c 100644
--- a/include/linux/virtio_blk.h
+++ b/include/linux/virtio_blk.h
@@ -16,6 +16,7 @@
#define VIRTIO_BLK_F_SCSI 7 /* Supports scsi command passthru */
#define VIRTIO_BLK_F_FLUSH 9 /* Cache flush command support */
#define VIRTIO_BLK_F_TOPOLOGY 10 /* Topology information is available */
+#define VIRTIO_BLK_F_FUA 11 /* Forced Unit Access write support */

#define VIRTIO_BLK_ID_BYTES 20 /* ID string length */

@@ -70,7 +71,10 @@ struct virtio_blk_config {
#define VIRTIO_BLK_T_FLUSH 4

/* Get device ID command */
-#define VIRTIO_BLK_T_GET_ID 8
+#define VIRTIO_BLK_T_GET_ID 8
+
+/* FUA command */
+#define VIRTIO_BLK_T_FUA 16

/* Barrier before this op. */
#define VIRTIO_BLK_T_BARRIER 0x80000000
--
1.7.1

2010-08-16 16:55:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/5] lguest: replace VIRTIO_F_BARRIER support with VIRTIO_F_FLUSH/FUA support

From: Tejun Heo <[email protected]>

VIRTIO_F_BARRIER is deprecated. Replace it with VIRTIO_F_FLUSH/FUA
support.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Rusty Russell <[email protected]>
---
Documentation/lguest/lguest.c | 36 ++++++++++++++++--------------------
1 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
index e9ce3c5..3be47d4 100644
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -1639,15 +1639,6 @@ static void blk_request(struct virtqueue *vq)
off = out->sector * 512;

/*
- * The block device implements "barriers", where the Guest indicates
- * that it wants all previous writes to occur before this write. We
- * don't have a way of asking our kernel to do a barrier, so we just
- * synchronize all the data in the file. Pretty poor, no?
- */
- if (out->type & VIRTIO_BLK_T_BARRIER)
- fdatasync(vblk->fd);
-
- /*
* In general the virtio block driver is allowed to try SCSI commands.
* It'd be nice if we supported eject, for example, but we don't.
*/
@@ -1679,6 +1670,19 @@ static void blk_request(struct virtqueue *vq)
/* Die, bad Guest, die. */
errx(1, "Write past end %llu+%u", off, ret);
}
+
+ /* Honor FUA by syncing everything. */
+ if (ret >= 0 && (out->type & VIRTIO_BLK_T_FUA)) {
+ ret = fdatasync(vblk->fd);
+ verbose("FUA fdatasync: %i\n", ret);
+ }
+
+ wlen = sizeof(*in);
+ *in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
+ } else if (out->type & VIRTIO_BLK_T_FLUSH) {
+ /* Flush */
+ ret = fdatasync(vblk->fd);
+ verbose("FLUSH fdatasync: %i\n", ret);
wlen = sizeof(*in);
*in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
} else {
@@ -1702,15 +1706,6 @@ static void blk_request(struct virtqueue *vq)
}
}

- /*
- * OK, so we noted that it was pretty poor to use an fdatasync as a
- * barrier. But Christoph Hellwig points out that we need a sync
- * *afterwards* as well: "Barriers specify no reordering to the front
- * or the back." And Jens Axboe confirmed it, so here we are:
- */
- if (out->type & VIRTIO_BLK_T_BARRIER)
- fdatasync(vblk->fd);
-
/* Finished that request. */
add_used(vq, head, wlen);
}
@@ -1735,8 +1730,9 @@ static void setup_block_file(const char *filename)
vblk->fd = open_or_die(filename, O_RDWR|O_LARGEFILE);
vblk->len = lseek64(vblk->fd, 0, SEEK_END);

- /* We support barriers. */
- add_feature(dev, VIRTIO_BLK_F_BARRIER);
+ /* We support FLUSH and FUA. */
+ add_feature(dev, VIRTIO_BLK_F_FLUSH);
+ add_feature(dev, VIRTIO_BLK_F_FUA);

/* Tell Guest how many sectors this device has. */
conf.capacity = cpu_to_le64(vblk->len / 512);
--
1.7.1

2010-08-16 16:56:52

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 5/5] dm: implement REQ_FLUSH/FUA support

From: Tejun Heo <[email protected]>

This patch converts dm to support REQ_FLUSH/FUA instead of now
deprecated REQ_HARDBARRIER.

For common parts,

* Barrier retry logic dropped from dm-io.

* Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes.

* It's now guaranteed that all FLUSH bio's which are passed onto dm
targets are zero length. bio_empty_barrier() tests are replaced
with REQ_FLUSH tests.

* Dropped unlikely() around REQ_FLUSH tests. Flushes are not unlikely
enough to be marked with unlikely().

For bio-based dm,

* Preflush is handled as before but postflush is dropped and replaced
with passing down REQ_FUA to member request_queues. This replaces
one array wide cache flush w/ member specific FUA writes.

* __split_and_process_bio() now calls __clone_and_map_flush() directly
for flushes and guarantees all FLUSH bio's going to targets are zero
length.

* -EOPNOTSUPP retry logic dropped.

For request-based dm,

* Nothing much changes. It just needs to handle FLUSH requests as
before. It would be beneficial to advertise FUA capability so that
it can propagate FUA flags down to member request_queues instead of
sequencing it as WRITE + FLUSH at the top queue.

Lightly tested linear, stripe, raid1, snap and crypt targets. Please
proceed with caution as I'm not familiar with the code base.

Signed-off-by: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: Christoph Hellwig <[email protected]>
---
drivers/md/dm-crypt.c | 2 +-
drivers/md/dm-io.c | 20 +----
drivers/md/dm-log.c | 2 +-
drivers/md/dm-raid1.c | 8 +-
drivers/md/dm-region-hash.c | 16 ++--
drivers/md/dm-snap-persistent.c | 2 +-
drivers/md/dm-snap.c | 6 +-
drivers/md/dm-stripe.c | 2 +-
drivers/md/dm.c | 180 +++++++++++++++++++-------------------
9 files changed, 113 insertions(+), 125 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 3bdbb61..da11652 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1249,7 +1249,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio,
struct dm_crypt_io *io;
struct crypt_config *cc;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio->bi_rw & REQ_FLUSH) {
cc = ti->private;
bio->bi_bdev = cc->dev->bdev;
return DM_MAPIO_REMAPPED;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 0590c75..136d4f7 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -31,7 +31,6 @@ struct dm_io_client {
*/
struct io {
unsigned long error_bits;
- unsigned long eopnotsupp_bits;
atomic_t count;
struct task_struct *sleeper;
struct dm_io_client *client;
@@ -130,11 +129,8 @@ static void retrieve_io_and_region_from_bio(struct bio *bio, struct io **io,
*---------------------------------------------------------------*/
static void dec_count(struct io *io, unsigned int region, int error)
{
- if (error) {
+ if (error)
set_bit(region, &io->error_bits);
- if (error == -EOPNOTSUPP)
- set_bit(region, &io->eopnotsupp_bits);
- }

if (atomic_dec_and_test(&io->count)) {
if (io->sleeper)
@@ -310,8 +306,8 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where,
sector_t remaining = where->count;

/*
- * where->count may be zero if rw holds a write barrier and we
- * need to send a zero-sized barrier.
+ * where->count may be zero if rw holds a flush and we need to
+ * send a zero-sized flush.
*/
do {
/*
@@ -364,7 +360,7 @@ static void dispatch_io(int rw, unsigned int num_regions,
*/
for (i = 0; i < num_regions; i++) {
*dp = old_pages;
- if (where[i].count || (rw & REQ_HARDBARRIER))
+ if (where[i].count || (rw & REQ_FLUSH))
do_region(rw, i, where + i, dp, io);
}

@@ -393,9 +389,7 @@ static int sync_io(struct dm_io_client *client, unsigned int num_regions,
return -EIO;
}

-retry:
io->error_bits = 0;
- io->eopnotsupp_bits = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */
io->sleeper = current;
io->client = client;
@@ -412,11 +406,6 @@ retry:
}
set_current_state(TASK_RUNNING);

- if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
- rw &= ~REQ_HARDBARRIER;
- goto retry;
- }
-
if (error_bits)
*error_bits = io->error_bits;

@@ -437,7 +426,6 @@ static int async_io(struct dm_io_client *client, unsigned int num_regions,

io = mempool_alloc(client->pool, GFP_NOIO);
io->error_bits = 0;
- io->eopnotsupp_bits = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */
io->sleeper = NULL;
io->client = client;
diff --git a/drivers/md/dm-log.c b/drivers/md/dm-log.c
index 5a08be0..33420e6 100644
--- a/drivers/md/dm-log.c
+++ b/drivers/md/dm-log.c
@@ -300,7 +300,7 @@ static int flush_header(struct log_c *lc)
.count = 0,
};

- lc->io_req.bi_rw = WRITE_BARRIER;
+ lc->io_req.bi_rw = WRITE_FLUSH;

return dm_io(&lc->io_req, 1, &null_location, NULL);
}
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index 7413626..af33245 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -259,7 +259,7 @@ static int mirror_flush(struct dm_target *ti)
struct dm_io_region io[ms->nr_mirrors];
struct mirror *m;
struct dm_io_request io_req = {
- .bi_rw = WRITE_BARRIER,
+ .bi_rw = WRITE_FLUSH,
.mem.type = DM_IO_KMEM,
.mem.ptr.bvec = NULL,
.client = ms->io_client,
@@ -629,7 +629,7 @@ static void do_write(struct mirror_set *ms, struct bio *bio)
struct dm_io_region io[ms->nr_mirrors], *dest = io;
struct mirror *m;
struct dm_io_request io_req = {
- .bi_rw = WRITE | (bio->bi_rw & WRITE_BARRIER),
+ .bi_rw = WRITE | (bio->bi_rw & WRITE_FLUSH_FUA),
.mem.type = DM_IO_BVEC,
.mem.ptr.bvec = bio->bi_io_vec + bio->bi_idx,
.notify.fn = write_callback,
@@ -670,7 +670,7 @@ static void do_writes(struct mirror_set *ms, struct bio_list *writes)
bio_list_init(&requeue);

while ((bio = bio_list_pop(writes))) {
- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio->bi_rw & REQ_FLUSH) {
bio_list_add(&sync, bio);
continue;
}
@@ -1203,7 +1203,7 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio,
* We need to dec pending if this was a write.
*/
if (rw == WRITE) {
- if (likely(!bio_empty_barrier(bio)))
+ if (!(bio->bi_rw & REQ_FLUSH))
dm_rh_dec(ms->rh, map_context->ll);
return error;
}
diff --git a/drivers/md/dm-region-hash.c b/drivers/md/dm-region-hash.c
index bd5c58b..dad011a 100644
--- a/drivers/md/dm-region-hash.c
+++ b/drivers/md/dm-region-hash.c
@@ -81,9 +81,9 @@ struct dm_region_hash {
struct list_head failed_recovered_regions;

/*
- * If there was a barrier failure no regions can be marked clean.
+ * If there was a flush failure no regions can be marked clean.
*/
- int barrier_failure;
+ int flush_failure;

void *context;
sector_t target_begin;
@@ -217,7 +217,7 @@ struct dm_region_hash *dm_region_hash_create(
INIT_LIST_HEAD(&rh->quiesced_regions);
INIT_LIST_HEAD(&rh->recovered_regions);
INIT_LIST_HEAD(&rh->failed_recovered_regions);
- rh->barrier_failure = 0;
+ rh->flush_failure = 0;

rh->region_pool = mempool_create_kmalloc_pool(MIN_REGIONS,
sizeof(struct dm_region));
@@ -399,8 +399,8 @@ void dm_rh_mark_nosync(struct dm_region_hash *rh, struct bio *bio)
region_t region = dm_rh_bio_to_region(rh, bio);
int recovering = 0;

- if (bio_empty_barrier(bio)) {
- rh->barrier_failure = 1;
+ if (bio->bi_rw & REQ_FLUSH) {
+ rh->flush_failure = 1;
return;
}

@@ -524,7 +524,7 @@ void dm_rh_inc_pending(struct dm_region_hash *rh, struct bio_list *bios)
struct bio *bio;

for (bio = bios->head; bio; bio = bio->bi_next) {
- if (bio_empty_barrier(bio))
+ if (bio->bi_rw & REQ_FLUSH)
continue;
rh_inc(rh, dm_rh_bio_to_region(rh, bio));
}
@@ -555,9 +555,9 @@ void dm_rh_dec(struct dm_region_hash *rh, region_t region)
*/

/* do nothing for DM_RH_NOSYNC */
- if (unlikely(rh->barrier_failure)) {
+ if (unlikely(rh->flush_failure)) {
/*
- * If a write barrier failed some time ago, we
+ * If a write flush failed some time ago, we
* don't know whether or not this write made it
* to the disk, so we must resync the device.
*/
diff --git a/drivers/md/dm-snap-persistent.c b/drivers/md/dm-snap-persistent.c
index c097d8a..b9048f0 100644
--- a/drivers/md/dm-snap-persistent.c
+++ b/drivers/md/dm-snap-persistent.c
@@ -687,7 +687,7 @@ static void persistent_commit_exception(struct dm_exception_store *store,
/*
* Commit exceptions to disk.
*/
- if (ps->valid && area_io(ps, WRITE_BARRIER))
+ if (ps->valid && area_io(ps, WRITE_FLUSH_FUA))
ps->valid = 0;

/*
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 5485377..62a6586 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1581,7 +1581,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio,
chunk_t chunk;
struct dm_snap_pending_exception *pe = NULL;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio->bi_rw & REQ_FLUSH) {
bio->bi_bdev = s->cow->bdev;
return DM_MAPIO_REMAPPED;
}
@@ -1685,7 +1685,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio,
int r = DM_MAPIO_REMAPPED;
chunk_t chunk;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio->bi_rw & REQ_FLUSH) {
if (!map_context->flush_request)
bio->bi_bdev = s->origin->bdev;
else
@@ -2123,7 +2123,7 @@ static int origin_map(struct dm_target *ti, struct bio *bio,
struct dm_dev *dev = ti->private;
bio->bi_bdev = dev->bdev;

- if (unlikely(bio_empty_barrier(bio)))
+ if (bio->bi_rw & REQ_FLUSH)
return DM_MAPIO_REMAPPED;

/* Only tell snapshots if this is a write */
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index d6e28d7..0f534ca 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -214,7 +214,7 @@ static int stripe_map(struct dm_target *ti, struct bio *bio,
sector_t offset, chunk;
uint32_t stripe;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio->bi_rw & REQ_FLUSH) {
BUG_ON(map_context->flush_request >= sc->stripes);
bio->bi_bdev = sc->stripe[map_context->flush_request].dev->bdev;
return DM_MAPIO_REMAPPED;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b71cc9e..13e5707 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -139,21 +139,21 @@ struct mapped_device {
spinlock_t deferred_lock;

/*
- * An error from the barrier request currently being processed.
+ * An error from the flush request currently being processed.
*/
- int barrier_error;
+ int flush_error;

/*
- * Protect barrier_error from concurrent endio processing
+ * Protect flush_error from concurrent endio processing
* in request-based dm.
*/
- spinlock_t barrier_error_lock;
+ spinlock_t flush_error_lock;

/*
- * Processing queue (flush/barriers)
+ * Processing queue (flush)
*/
struct workqueue_struct *wq;
- struct work_struct barrier_work;
+ struct work_struct flush_work;

/* A pointer to the currently processing pre/post flush request */
struct request *flush_request;
@@ -195,8 +195,8 @@ struct mapped_device {
/* sysfs handle */
struct kobject kobj;

- /* zero-length barrier that will be cloned and submitted to targets */
- struct bio barrier_bio;
+ /* zero-length flush that will be cloned and submitted to targets */
+ struct bio flush_bio;
};

/*
@@ -507,7 +507,7 @@ static void end_io_acct(struct dm_io *io)

/*
* After this is decremented the bio must not be touched if it is
- * a barrier.
+ * a flush.
*/
dm_disk(md)->part0.in_flight[rw] = pending =
atomic_dec_return(&md->pending[rw]);
@@ -621,7 +621,7 @@ static void dec_pending(struct dm_io *io, int error)
*/
spin_lock_irqsave(&md->deferred_lock, flags);
if (__noflush_suspending(md)) {
- if (!(io->bio->bi_rw & REQ_HARDBARRIER))
+ if (!(io->bio->bi_rw & REQ_FLUSH))
bio_list_add_head(&md->deferred,
io->bio);
} else
@@ -633,14 +633,14 @@ static void dec_pending(struct dm_io *io, int error)
io_error = io->error;
bio = io->bio;

- if (bio->bi_rw & REQ_HARDBARRIER) {
+ if (bio->bi_rw & REQ_FLUSH) {
/*
- * There can be just one barrier request so we use
+ * There can be just one flush request so we use
* a per-device variable for error reporting.
* Note that you can't touch the bio after end_io_acct
*/
- if (!md->barrier_error && io_error != -EOPNOTSUPP)
- md->barrier_error = io_error;
+ if (!md->flush_error)
+ md->flush_error = io_error;
end_io_acct(io);
free_io(md, io);
} else {
@@ -744,21 +744,18 @@ static void end_clone_bio(struct bio *clone, int error)
blk_update_request(tio->orig, 0, nr_bytes);
}

-static void store_barrier_error(struct mapped_device *md, int error)
+static void store_flush_error(struct mapped_device *md, int error)
{
unsigned long flags;

- spin_lock_irqsave(&md->barrier_error_lock, flags);
+ spin_lock_irqsave(&md->flush_error_lock, flags);
/*
- * Basically, the first error is taken, but:
- * -EOPNOTSUPP supersedes any I/O error.
- * Requeue request supersedes any I/O error but -EOPNOTSUPP.
+ * Basically, the first error is taken, but requeue request
+ * supersedes any I/O error.
*/
- if (!md->barrier_error || error == -EOPNOTSUPP ||
- (md->barrier_error != -EOPNOTSUPP &&
- error == DM_ENDIO_REQUEUE))
- md->barrier_error = error;
- spin_unlock_irqrestore(&md->barrier_error_lock, flags);
+ if (!md->flush_error || error == DM_ENDIO_REQUEUE)
+ md->flush_error = error;
+ spin_unlock_irqrestore(&md->flush_error_lock, flags);
}

/*
@@ -799,12 +796,12 @@ static void dm_end_request(struct request *clone, int error)
{
int rw = rq_data_dir(clone);
int run_queue = 1;
- bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
+ bool is_flush = clone->cmd_flags & REQ_FLUSH;
struct dm_rq_target_io *tio = clone->end_io_data;
struct mapped_device *md = tio->md;
struct request *rq = tio->orig;

- if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+ if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {
rq->errors = clone->errors;
rq->resid_len = clone->resid_len;

@@ -819,9 +816,9 @@ static void dm_end_request(struct request *clone, int error)

free_rq_clone(clone);

- if (unlikely(is_barrier)) {
+ if (is_flush) {
if (unlikely(error))
- store_barrier_error(md, error);
+ store_flush_error(md, error);
run_queue = 0;
} else
blk_end_request_all(rq, error);
@@ -851,9 +848,9 @@ void dm_requeue_unmapped_request(struct request *clone)
struct request_queue *q = rq->q;
unsigned long flags;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+ if (clone->cmd_flags & REQ_FLUSH) {
/*
- * Barrier clones share an original request.
+ * Flush clones share an original request.
* Leave it to dm_end_request(), which handles this special
* case.
*/
@@ -950,14 +947,14 @@ static void dm_complete_request(struct request *clone, int error)
struct dm_rq_target_io *tio = clone->end_io_data;
struct request *rq = tio->orig;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+ if (clone->cmd_flags & REQ_FLUSH) {
/*
- * Barrier clones share an original request. So can't use
+ * Flush clones share an original request. So can't use
* softirq_done with the original.
* Pass the clone to dm_done() directly in this special case.
* It is safe (even if clone->q->queue_lock is held here)
* because there is no I/O dispatching during the completion
- * of barrier clone.
+ * of flush clone.
*/
dm_done(clone, error, true);
return;
@@ -979,9 +976,9 @@ void dm_kill_unmapped_request(struct request *clone, int error)
struct dm_rq_target_io *tio = clone->end_io_data;
struct request *rq = tio->orig;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+ if (clone->cmd_flags & REQ_FLUSH) {
/*
- * Barrier clones share an original request.
+ * Flush clones share an original request.
* Leave it to dm_end_request(), which handles this special
* case.
*/
@@ -1098,7 +1095,7 @@ static void dm_bio_destructor(struct bio *bio)
}

/*
- * Creates a little bio that is just does part of a bvec.
+ * Creates a little bio that just does part of a bvec.
*/
static struct bio *split_bvec(struct bio *bio, sector_t sector,
unsigned short idx, unsigned int offset,
@@ -1113,7 +1110,7 @@ static struct bio *split_bvec(struct bio *bio, sector_t sector,

clone->bi_sector = sector;
clone->bi_bdev = bio->bi_bdev;
- clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER;
+ clone->bi_rw = bio->bi_rw;
clone->bi_vcnt = 1;
clone->bi_size = to_bytes(len);
clone->bi_io_vec->bv_offset = offset;
@@ -1140,7 +1137,6 @@ static struct bio *clone_bio(struct bio *bio, sector_t sector,

clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
__bio_clone(clone, bio);
- clone->bi_rw &= ~REQ_HARDBARRIER;
clone->bi_destructor = dm_bio_destructor;
clone->bi_sector = sector;
clone->bi_idx = idx;
@@ -1186,7 +1182,7 @@ static void __flush_target(struct clone_info *ci, struct dm_target *ti,
__map_bio(ti, clone, tio);
}

-static int __clone_and_map_empty_barrier(struct clone_info *ci)
+static int __clone_and_map_flush(struct clone_info *ci)
{
unsigned target_nr = 0, flush_nr;
struct dm_target *ti;
@@ -1208,9 +1204,6 @@ static int __clone_and_map(struct clone_info *ci)
sector_t len = 0, max;
struct dm_target_io *tio;

- if (unlikely(bio_empty_barrier(bio)))
- return __clone_and_map_empty_barrier(ci);
-
ti = dm_table_find_target(ci->map, ci->sector);
if (!dm_target_is_valid(ti))
return -EIO;
@@ -1308,11 +1301,11 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)

ci.map = dm_get_live_table(md);
if (unlikely(!ci.map)) {
- if (!(bio->bi_rw & REQ_HARDBARRIER))
+ if (!(bio->bi_rw & REQ_FLUSH))
bio_io_error(bio);
else
- if (!md->barrier_error)
- md->barrier_error = -EIO;
+ if (!md->flush_error)
+ md->flush_error = -EIO;
return;
}

@@ -1325,14 +1318,22 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
ci.io->md = md;
spin_lock_init(&ci.io->endio_lock);
ci.sector = bio->bi_sector;
- ci.sector_count = bio_sectors(bio);
- if (unlikely(bio_empty_barrier(bio)))
+ if (!(bio->bi_rw & REQ_FLUSH))
+ ci.sector_count = bio_sectors(bio);
+ else {
+ /* all FLUSH bio's reaching here should be empty */
+ WARN_ON_ONCE(bio_has_data(bio));
ci.sector_count = 1;
+ }
ci.idx = bio->bi_idx;

start_io_acct(ci.io);
- while (ci.sector_count && !error)
- error = __clone_and_map(&ci);
+ while (ci.sector_count && !error) {
+ if (!(bio->bi_rw & REQ_FLUSH))
+ error = __clone_and_map(&ci);
+ else
+ error = __clone_and_map_flush(&ci);
+ }

/* drop the extra reference count */
dec_pending(ci.io, error);
@@ -1417,11 +1418,11 @@ static int _dm_request(struct request_queue *q, struct bio *bio)
part_stat_unlock();

/*
- * If we're suspended or the thread is processing barriers
+ * If we're suspended or the thread is processing flushes
* we have to queue this io for later.
*/
if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
- unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+ (bio->bi_rw & REQ_FLUSH)) {
up_read(&md->io_lock);

if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
@@ -1520,7 +1521,7 @@ static int setup_clone(struct request *clone, struct request *rq,
if (dm_rq_is_flush_request(rq)) {
blk_rq_init(NULL, clone);
clone->cmd_type = REQ_TYPE_FS;
- clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
+ clone->cmd_flags |= (REQ_FLUSH | WRITE);
} else {
r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
dm_rq_bio_constructor, tio);
@@ -1664,11 +1665,11 @@ static void dm_request_fn(struct request_queue *q)
if (!rq)
goto plug_and_out;

- if (unlikely(dm_rq_is_flush_request(rq))) {
+ if (dm_rq_is_flush_request(rq)) {
BUG_ON(md->flush_request);
md->flush_request = rq;
blk_start_request(rq);
- queue_work(md->wq, &md->barrier_work);
+ queue_work(md->wq, &md->flush_work);
goto out;
}

@@ -1843,7 +1844,7 @@ out:
static const struct block_device_operations dm_blk_dops;

static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);
+static void dm_rq_flush_work(struct work_struct *work);

/*
* Allocate and initialise a blank device with a given minor.
@@ -1873,7 +1874,7 @@ static struct mapped_device *alloc_dev(int minor)
init_rwsem(&md->io_lock);
mutex_init(&md->suspend_lock);
spin_lock_init(&md->deferred_lock);
- spin_lock_init(&md->barrier_error_lock);
+ spin_lock_init(&md->flush_error_lock);
rwlock_init(&md->map_lock);
atomic_set(&md->holders, 1);
atomic_set(&md->open_count, 0);
@@ -1918,7 +1919,7 @@ static struct mapped_device *alloc_dev(int minor)
atomic_set(&md->pending[1], 0);
init_waitqueue_head(&md->wait);
INIT_WORK(&md->work, dm_wq_work);
- INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
+ INIT_WORK(&md->flush_work, dm_rq_flush_work);
init_waitqueue_head(&md->eventq);

md->disk->major = _major;
@@ -2233,36 +2234,35 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
return r;
}

-static void dm_flush(struct mapped_device *md)
+static void process_flush(struct mapped_device *md, struct bio *bio)
{
- dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
- bio_init(&md->barrier_bio);
- md->barrier_bio.bi_bdev = md->bdev;
- md->barrier_bio.bi_rw = WRITE_BARRIER;
- __split_and_process_bio(md, &md->barrier_bio);
+ md->flush_error = 0;

+ /* handle REQ_FLUSH */
dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-}

-static void process_barrier(struct mapped_device *md, struct bio *bio)
-{
- md->barrier_error = 0;
+ bio_init(&md->flush_bio);
+ md->flush_bio.bi_bdev = md->bdev;
+ md->flush_bio.bi_rw = WRITE_FLUSH;
+ __split_and_process_bio(md, &md->flush_bio);

- dm_flush(md);
+ dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);

- if (!bio_empty_barrier(bio)) {
- __split_and_process_bio(md, bio);
- dm_flush(md);
+ /* if it's an empty flush or the preflush failed, we're done */
+ if (!bio_has_data(bio) || md->flush_error) {
+ if (md->flush_error != DM_ENDIO_REQUEUE)
+ bio_endio(bio, md->flush_error);
+ else {
+ spin_lock_irq(&md->deferred_lock);
+ bio_list_add_head(&md->deferred, bio);
+ spin_unlock_irq(&md->deferred_lock);
+ }
+ return;
}

- if (md->barrier_error != DM_ENDIO_REQUEUE)
- bio_endio(bio, md->barrier_error);
- else {
- spin_lock_irq(&md->deferred_lock);
- bio_list_add_head(&md->deferred, bio);
- spin_unlock_irq(&md->deferred_lock);
- }
+ /* issue data + REQ_FUA */
+ bio->bi_rw &= ~REQ_FLUSH;
+ __split_and_process_bio(md, bio);
}

/*
@@ -2291,8 +2291,8 @@ static void dm_wq_work(struct work_struct *work)
if (dm_request_based(md))
generic_make_request(c);
else {
- if (c->bi_rw & REQ_HARDBARRIER)
- process_barrier(md, c);
+ if (c->bi_rw & REQ_FLUSH)
+ process_flush(md, c);
else
__split_and_process_bio(md, c);
}
@@ -2317,8 +2317,8 @@ static void dm_rq_set_flush_nr(struct request *clone, unsigned flush_nr)
tio->info.flush_request = flush_nr;
}

-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
+/* Issue flush requests to targets and wait for their completion. */
+static int dm_rq_flush(struct mapped_device *md)
{
int i, j;
struct dm_table *map = dm_get_live_table(md);
@@ -2326,7 +2326,7 @@ static int dm_rq_barrier(struct mapped_device *md)
struct dm_target *ti;
struct request *clone;

- md->barrier_error = 0;
+ md->flush_error = 0;

for (i = 0; i < num_targets; i++) {
ti = dm_table_get_target(map, i);
@@ -2341,26 +2341,26 @@ static int dm_rq_barrier(struct mapped_device *md)
dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
dm_table_put(map);

- return md->barrier_error;
+ return md->flush_error;
}

-static void dm_rq_barrier_work(struct work_struct *work)
+static void dm_rq_flush_work(struct work_struct *work)
{
int error;
struct mapped_device *md = container_of(work, struct mapped_device,
- barrier_work);
+ flush_work);
struct request_queue *q = md->queue;
struct request *rq;
unsigned long flags;

/*
* Hold the md reference here and leave it at the last part so that
- * the md can't be deleted by device opener when the barrier request
+ * the md can't be deleted by device opener when the flush request
* completes.
*/
dm_get(md);

- error = dm_rq_barrier(md);
+ error = dm_rq_flush(md);

rq = md->flush_request;
md->flush_request = NULL;
@@ -2520,7 +2520,7 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
up_write(&md->io_lock);

/*
- * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
+ * Request-based dm uses md->wq for flush (dm_rq_flush_work) which
* can be kicked until md->queue is stopped. So stop md->queue before
* flushing md->wq.
*/
--
1.7.1

2010-08-16 18:33:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/5] virtio_blk: implement REQ_FLUSH/FUA support

On Mon, Aug 16, 2010 at 06:52:00PM +0200, Tejun Heo wrote:
> From: Tejun Heo <[email protected]>
>
> Remove now unused REQ_HARDBARRIER support and implement REQ_FLUSH/FUA
> support instead. A new feature flag VIRTIO_BLK_F_FUA is added to
> indicate the support for FUA.

I'm not sure it's worth it. The pure REQ_FLUSH path works not and is
well tested with kvm/qemu. We can still easily add a FUA bit, and
even a pre-flush bit if the protocol roundtrips matter in real life
benchmarking.

2010-08-16 19:04:15

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH 5/5] dm: implement REQ_FLUSH/FUA support

On Mon, Aug 16 2010 at 12:52pm -0400,
Tejun Heo <[email protected]> wrote:

> From: Tejun Heo <[email protected]>
>
> This patch converts dm to support REQ_FLUSH/FUA instead of now
> deprecated REQ_HARDBARRIER.

What tree does this patch apply to? I know it doesn't apply to
v2.6.36-rc1, e.g.: http://git.kernel.org/linus/708e929513502fb0

> For bio-based dm,
...
> * -EOPNOTSUPP retry logic dropped.

That logic wasn't just about retries (at least not in the latest
kernel). With commit 708e929513502fb0 the -EOPNOTSUPP checking also
serves to optimize the barrier+discard case (when discards aren't
supported).

> For request-based dm,
>
> * Nothing much changes. It just needs to handle FLUSH requests as
> before. It would be beneficial to advertise FUA capability so that
> it can propagate FUA flags down to member request_queues instead of
> sequencing it as WRITE + FLUSH at the top queue.

Can you expand on that TODO a bit? What is the mechanism to propagate
FUA down to a DM device's members? I'm only aware of propagating member
devices' features up to the top-level DM device's request-queue (not the
opposite).

Are you saying that establishing the FUA capability on the top-level DM
device's request_queue is sufficient? If so then why not make the
change?

> Lightly tested linear, stripe, raid1, snap and crypt targets. Please
> proceed with caution as I'm not familiar with the code base.

This is concerning... if we're to offer more comprehensive review I
think we need more detail on what guided your changes rather than
details of what the resulting changes are.

Mike

2010-08-17 01:16:58

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 2/5] virtio_blk: implement REQ_FLUSH/FUA support

On Tue, 17 Aug 2010 02:22:00 am Tejun Heo wrote:
> From: Tejun Heo <[email protected]>
>
> Remove now unused REQ_HARDBARRIER support and implement REQ_FLUSH/FUA
> support instead. A new feature flag VIRTIO_BLK_F_FUA is added to
> indicate the support for FUA.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Michael S. Tsirkin <[email protected]>

Acked-by: Rusty Russell <[email protected]>

And also for the lguest-specific patch.

Thanks!
Rusty.

2010-08-17 08:20:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/5] virtio_blk: implement REQ_FLUSH/FUA support

Hello,

On 08/16/2010 08:33 PM, Christoph Hellwig wrote:
> On Mon, Aug 16, 2010 at 06:52:00PM +0200, Tejun Heo wrote:
>> From: Tejun Heo <[email protected]>
>>
>> Remove now unused REQ_HARDBARRIER support and implement REQ_FLUSH/FUA
>> support instead. A new feature flag VIRTIO_BLK_F_FUA is added to
>> indicate the support for FUA.
>
> I'm not sure it's worth it. The pure REQ_FLUSH path works not and is
> well tested with kvm/qemu. We can still easily add a FUA bit, and
> even a pre-flush bit if the protocol roundtrips matter in real life
> benchmarking.

Hmmm... the underlying storage could be md/dm RAIDs in which case FUA
should be cheaper than FLUSH.

Thanks.

--
tejun

2010-08-17 08:22:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/5] virtio_blk: implement REQ_FLUSH/FUA support

On 08/17/2010 03:16 AM, Rusty Russell wrote:
> On Tue, 17 Aug 2010 02:22:00 am Tejun Heo wrote:
>> From: Tejun Heo <[email protected]>
>>
>> Remove now unused REQ_HARDBARRIER support and implement REQ_FLUSH/FUA
>> support instead. A new feature flag VIRTIO_BLK_F_FUA is added to
>> indicate the support for FUA.
>>
>> Signed-off-by: Tejun Heo <[email protected]>
>> Cc: Michael S. Tsirkin <[email protected]>
>
> Acked-by: Rusty Russell <[email protected]>
>
> And also for the lguest-specific patch.

Thanks. Acked-by's added.

--
tejun

2010-08-17 09:39:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/5] dm: implement REQ_FLUSH/FUA support

Hello,

On 08/16/2010 09:02 PM, Mike Snitzer wrote:
> On Mon, Aug 16 2010 at 12:52pm -0400,
> Tejun Heo <[email protected]> wrote:
>
>> From: Tejun Heo <[email protected]>
>>
>> This patch converts dm to support REQ_FLUSH/FUA instead of now
>> deprecated REQ_HARDBARRIER.
>
> What tree does this patch apply to? I know it doesn't apply to
> v2.6.36-rc1, e.g.: http://git.kernel.org/linus/708e929513502fb0

(from the head message)
These patches are on top of

block#for-2.6.36-post (c047ab2dddeeafbd6f7c00e45a13a5c4da53ea0b)
+ block-replace-barrier-with-sequenced-flush patchset[1]
+ block-fix-incorrect-bio-request-flag-conversion-in-md patch[2]

and available in the following git tree.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua

[1] http://thread.gmane.org/gmane.linux.kernel/1022363
[2] http://thread.gmane.org/gmane.linux.kernel/1023435

Probably fetching the git tree is the easist way to review?

>> For bio-based dm,
>> * -EOPNOTSUPP retry logic dropped.
>
> That logic wasn't just about retries (at least not in the latest
> kernel). With commit 708e929513502fb0 the -EOPNOTSUPP checking also
> serves to optimize the barrier+discard case (when discards aren't
> supported).

With the patch applied, there's no second flush. Those requests would
now be REQ_FLUSH + REQ_DISCARD. The first can't be avoided anyway and
there won't be the second flush to begin with, so I don't think this
worsens anything.

>> * Nothing much changes. It just needs to handle FLUSH requests as
>> before. It would be beneficial to advertise FUA capability so that
>> it can propagate FUA flags down to member request_queues instead of
>> sequencing it as WRITE + FLUSH at the top queue.
>
> Can you expand on that TODO a bit? What is the mechanism to propagate
> FUA down to a DM device's members? I'm only aware of propagating member
> devices' features up to the top-level DM device's request-queue (not the
> opposite).
>
> Are you saying that establishing the FUA capability on the top-level DM
> device's request_queue is sufficient? If so then why not make the
> change?

Yeah, I think it would be enough to always advertise FLUSH|FUA if the
member devices support FLUSH (regardless of FUA support). The reason
why I didn't do it was, umm, laziness, I suppose.

>> Lightly tested linear, stripe, raid1, snap and crypt targets. Please
>> proceed with caution as I'm not familiar with the code base.
>
> This is concerning...

Yeap, I want you to be concerned. :-) This was the first time I looked
at the dm code and there are many different disjoint code paths and I
couldn't fully follow or test all of them, so it definitely needs a
careful review from someone who understands the whole thing.

> if we're to offer more comprehensive review I think we need more
> detail on what guided your changes rather than details of what the
> resulting changes are.

I'll try to explain it. If you have any further questions, please let
me know.

* For common part (dm-io, dm-log):

* Empty REQ_HARDBARRIER is converted to empty REQ_FLUSH.

* REQ_HARDBARRIER w/ data is converted either to data + REQ_FLUSH +
REQ_FUA or data + REQ_FUA. The former is the safe equivalent
conversion but there could be cases where ther latter is enough.

* For bio based dm:

* Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't have any ordering
requirements. Remove assumptions of ordering and/or draining.

A related question: Is dm_wait_for_completion() used in
process_flush() safe against starvation under continuous influx of
other commands?

* As REQ_FLUSH/FUA doesn't require any ordering of requests before
or after it, on array devices, the latter part - REQ_FUA - can be
handled like other writes. ie. REQ_FLUSH needs to be broadcasted
to all devices but once that is complete the data/REQ_FUA bio can
be sent to only the affected devices. This needs some care as
there are bio cloning/splitting code paths where REQ_FUA bit isn't
preserved.

* Guarantee that REQ_FLUSH w/ data never reaches targets (this in
part is to put it in alignment with request based dm).

* For request based dm:

* The sequencing is done by the block layer for the top level
request_queue, so the only things request based dm needs to make
sure is 1. handling empty REQ_FLUSH correctly (block layer will
only send down empty REQ_FLUSHes) and 2. propagate REQ_FUA bit to
member devices.

Thanks.

--
tejun

2010-08-17 13:13:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/5] dm: implement REQ_FLUSH/FUA support

On Tue, Aug 17, 2010 at 11:33:52AM +0200, Tejun Heo wrote:
> > That logic wasn't just about retries (at least not in the latest
> > kernel). With commit 708e929513502fb0 the -EOPNOTSUPP checking also
> > serves to optimize the barrier+discard case (when discards aren't
> > supported).
>
> With the patch applied, there's no second flush. Those requests would
> now be REQ_FLUSH + REQ_DISCARD. The first can't be avoided anyway and
> there won't be the second flush to begin with, so I don't think this
> worsens anything.

In fact the pre-flush is completely superflous for discards, but that's a
separate discussion and should not be changed as part of this patchset but
rather explicitly.

2010-08-17 13:24:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/5] virtio_blk: implement REQ_FLUSH/FUA support

On Tue, Aug 17, 2010 at 10:17:15AM +0200, Tejun Heo wrote:
> >> Remove now unused REQ_HARDBARRIER support and implement REQ_FLUSH/FUA
> >> support instead. A new feature flag VIRTIO_BLK_F_FUA is added to
> >> indicate the support for FUA.
> >
> > I'm not sure it's worth it. The pure REQ_FLUSH path works not and is
> > well tested with kvm/qemu. We can still easily add a FUA bit, and
> > even a pre-flush bit if the protocol roundtrips matter in real life
> > benchmarking.
>
> Hmmm... the underlying storage could be md/dm RAIDs in which case FUA
> should be cheaper than FLUSH.

If someone ever wrote a virtio-blk backend that sits directly ontop
of the Linux block layer that would be true. Of the five known
virtio-blk backends all operate on normal files using the Posix I/O
APIs, or the Linux aio API (optionally in qemu) or in-kernel
vfs_read/vfs_write (vhost-blk).

Given how little testing lguest gets compared to qemu I really don't
want a protocol addition for it unless it really buys us something.
Once we're done with this barrier conversion I plan into benchmarking
FUA and a pre-flush tag on the command for virtio in real life setups,
and see if it actually buys us anything.

2010-08-17 14:08:12

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH 5/5] dm: implement REQ_FLUSH/FUA support

On Tue, Aug 17 2010 at 5:33am -0400,
Tejun Heo <[email protected]> wrote:

> Hello,
>
> On 08/16/2010 09:02 PM, Mike Snitzer wrote:
> > On Mon, Aug 16 2010 at 12:52pm -0400,
> > Tejun Heo <[email protected]> wrote:
> >
> >> From: Tejun Heo <[email protected]>
> >>
> >> This patch converts dm to support REQ_FLUSH/FUA instead of now
> >> deprecated REQ_HARDBARRIER.
> >
> > What tree does this patch apply to? I know it doesn't apply to
> > v2.6.36-rc1, e.g.: http://git.kernel.org/linus/708e929513502fb0
>
> (from the head message)
> These patches are on top of
>
> block#for-2.6.36-post (c047ab2dddeeafbd6f7c00e45a13a5c4da53ea0b)
> + block-replace-barrier-with-sequenced-flush patchset[1]
> + block-fix-incorrect-bio-request-flag-conversion-in-md patch[2]
>
> and available in the following git tree.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua
>
> [1] http://thread.gmane.org/gmane.linux.kernel/1022363
> [2] http://thread.gmane.org/gmane.linux.kernel/1023435
>
> Probably fetching the git tree is the easist way to review?

OK, I missed this info because I just looked at the DM patch.

> >> For bio-based dm,
> >> * -EOPNOTSUPP retry logic dropped.
> >
> > That logic wasn't just about retries (at least not in the latest
> > kernel). With commit 708e929513502fb0 the -EOPNOTSUPP checking also
> > serves to optimize the barrier+discard case (when discards aren't
> > supported).
>
> With the patch applied, there's no second flush. Those requests would
> now be REQ_FLUSH + REQ_DISCARD. The first can't be avoided anyway and
> there won't be the second flush to begin with, so I don't think this
> worsens anything.

Makes sense, but your patches still need to be refreshed against the
latest (2.6.36-rc1) upstream code. Numerous changes went in to DM
recently.

> >> * Nothing much changes. It just needs to handle FLUSH requests as
> >> before. It would be beneficial to advertise FUA capability so that
> >> it can propagate FUA flags down to member request_queues instead of
> >> sequencing it as WRITE + FLUSH at the top queue.
> >
> > Can you expand on that TODO a bit? What is the mechanism to propagate
> > FUA down to a DM device's members? I'm only aware of propagating member
> > devices' features up to the top-level DM device's request-queue (not the
> > opposite).
> >
> > Are you saying that establishing the FUA capability on the top-level DM
> > device's request_queue is sufficient? If so then why not make the
> > change?
>
> Yeah, I think it would be enough to always advertise FLUSH|FUA if the
> member devices support FLUSH (regardless of FUA support). The reason
> why I didn't do it was, umm, laziness, I suppose.

I don't buy it.. you're far from lazy! ;)

> >> Lightly tested linear, stripe, raid1, snap and crypt targets. Please
> >> proceed with caution as I'm not familiar with the code base.
> >
> > This is concerning...
>
> Yeap, I want you to be concerned. :-) This was the first time I looked
> at the dm code and there are many different disjoint code paths and I
> couldn't fully follow or test all of them, so it definitely needs a
> careful review from someone who understands the whole thing.

You'll need Mikulas (bio-based) and NEC (request-based, Kiyoshi and
Jun'ichi) to give it serious review.

NOTE: NEC has already given some preliminary feedback to hch in the
"[PATCH, RFC 2/2] dm: support REQ_FLUSH directly" thread:
https://www.redhat.com/archives/dm-devel/2010-August/msg00026.html
https://www.redhat.com/archives/dm-devel/2010-August/msg00033.html

> > if we're to offer more comprehensive review I think we need more
> > detail on what guided your changes rather than details of what the
> > resulting changes are.
>
> I'll try to explain it. If you have any further questions, please let
> me know.

Thanks for the additional details.

> * For bio based dm:
>
> * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't have any ordering
> requirements. Remove assumptions of ordering and/or draining.
>
> A related question: Is dm_wait_for_completion() used in
> process_flush() safe against starvation under continuous influx of
> other commands?

OK, so you folded dm_flush() directly into process_flush() -- the code
that was dm_flush() only needs to be called once now.

As for your specific dm_wait_for_completion() concern -- I'll defer to
Mikulas. But I'll add: we haven't had any reported starvation issues
with DM's existing barrier support. DM uses a mempool for its clones,
so it should naturally throttle (without starvation) when memory gets
low.

> * As REQ_FLUSH/FUA doesn't require any ordering of requests before
> or after it, on array devices, the latter part - REQ_FUA - can be
> handled like other writes. ie. REQ_FLUSH needs to be broadcasted
> to all devices but once that is complete the data/REQ_FUA bio can
> be sent to only the affected devices. This needs some care as
> there are bio cloning/splitting code paths where REQ_FUA bit isn't
> preserved.
>
> * Guarantee that REQ_FLUSH w/ data never reaches targets (this in
> part is to put it in alignment with request based dm).

bio-based DM already split the barrier out from the data (in
process_barrier). You've renamed process_barrier to process_flush and
added the REQ_FLUSH logic like I'd expect.

> * For request based dm:
>
> * The sequencing is done by the block layer for the top level
> request_queue, so the only things request based dm needs to make
> sure is 1. handling empty REQ_FLUSH correctly (block layer will
> only send down empty REQ_FLUSHes) and 2. propagate REQ_FUA bit to
> member devices.

OK, so seems 1 is done, 2 is still TODO. Looking at your tree it seems
2 would be as simple as using the following in
dm_init_request_based_queue (on the most current upstream dm.c):

blk_queue_flush(q, REQ_FLUSH | REQ_FUA);

(your current patch only sets REQ_FLUSH in alloc_dev).

2010-08-17 16:27:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/5] virtio_blk: implement REQ_FLUSH/FUA support

Hello,

On 08/17/2010 03:23 PM, Christoph Hellwig wrote:
>> Hmmm... the underlying storage could be md/dm RAIDs in which case FUA
>> should be cheaper than FLUSH.
>
> If someone ever wrote a virtio-blk backend that sits directly ontop
> of the Linux block layer that would be true. Of the five known
> virtio-blk backends all operate on normal files using the Posix I/O
> APIs, or the Linux aio API (optionally in qemu) or in-kernel
> vfs_read/vfs_write (vhost-blk).

Right.

> Given how little testing lguest gets compared to qemu I really don't
> want a protocol addition for it unless it really buys us something.
> Once we're done with this barrier conversion I plan into benchmarking
> FUA and a pre-flush tag on the command for virtio in real life setups,
> and see if it actually buys us anything.

Hmmm... yeah, we can drop it. Michael, what do you think?

Thanks.

--
tejun

2010-08-17 16:55:15

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/5] dm: implement REQ_FLUSH/FUA support

Hello,

On 08/17/2010 04:07 PM, Mike Snitzer wrote:
>> With the patch applied, there's no second flush. Those requests would
>> now be REQ_FLUSH + REQ_DISCARD. The first can't be avoided anyway and
>> there won't be the second flush to begin with, so I don't think this
>> worsens anything.
>
> Makes sense, but your patches still need to be refreshed against the
> latest (2.6.36-rc1) upstream code. Numerous changes went in to DM
> recently.

Sure thing. The block part isn't fixed yet and so the RFC tag. Once
the block layer part is settled, it probably should be pulled into
dm/md and other trees and conversions should happen there.

>> Yeap, I want you to be concerned. :-) This was the first time I looked
>> at the dm code and there are many different disjoint code paths and I
>> couldn't fully follow or test all of them, so it definitely needs a
>> careful review from someone who understands the whole thing.
>
> You'll need Mikulas (bio-based) and NEC (request-based, Kiyoshi and
> Jun'ichi) to give it serious review.

Oh, you already cc'd them. Great. Hello, guys, the original thread
is

http://thread.gmane.org/gmane.linux.raid/29100

> NOTE: NEC has already given some preliminary feedback to hch in the
> "[PATCH, RFC 2/2] dm: support REQ_FLUSH directly" thread:
> https://www.redhat.com/archives/dm-devel/2010-August/msg00026.html
> https://www.redhat.com/archives/dm-devel/2010-August/msg00033.html

Hmmm... I think both issues don't exist in this incarnation of
conversion although I'm fairly sure there will be other issues. :-)

>> A related question: Is dm_wait_for_completion() used in
>> process_flush() safe against starvation under continuous influx of
>> other commands?
>
> As for your specific dm_wait_for_completion() concern -- I'll defer to
> Mikulas. But I'll add: we haven't had any reported starvation issues
> with DM's existing barrier support. DM uses a mempool for its clones,
> so it should naturally throttle (without starvation) when memory gets
> low.

I see but single pending flush and steady write streams w/o saturating
the mempool would be able to stall dm_wait_for_completeion(), no? Eh
well, it's a separate issue, I guess.

>> * Guarantee that REQ_FLUSH w/ data never reaches targets (this in
>> part is to put it in alignment with request based dm).
>
> bio-based DM already split the barrier out from the data (in
> process_barrier). You've renamed process_barrier to process_flush and
> added the REQ_FLUSH logic like I'd expect.

Yeah and threw in WARN_ON() there to make sure REQ_FLUSH + data bios
don't slip through for whatever reason.

>> * For request based dm:
>>
>> * The sequencing is done by the block layer for the top level
>> request_queue, so the only things request based dm needs to make
>> sure is 1. handling empty REQ_FLUSH correctly (block layer will
>> only send down empty REQ_FLUSHes) and 2. propagate REQ_FUA bit to
>> member devices.
>
> OK, so seems 1 is done, 2 is still TODO. Looking at your tree it seems
> 2 would be as simple as using the following in

Oh, I was talking about the other way around. Passing REQ_FUA in
bio->bi_rw down to member request_queues. Sometimes while
constructing clone / split bios, the bit is lost (e.g. md raid5).

> dm_init_request_based_queue (on the most current upstream dm.c):
> blk_queue_flush(q, REQ_FLUSH | REQ_FUA);
> (your current patch only sets REQ_FLUSH in alloc_dev).

Yeah, but for that direction, just adding REQ_FUA to blk_queue_flush()
should be enough. I'll add it.

Thanks.

--
tejun

2010-08-17 18:27:12

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH 5/5] dm: implement REQ_FLUSH/FUA support

On Tue, Aug 17 2010 at 12:51pm -0400,
Tejun Heo <[email protected]> wrote:

> Hello,
>
> On 08/17/2010 04:07 PM, Mike Snitzer wrote:
> >> With the patch applied, there's no second flush. Those requests would
> >> now be REQ_FLUSH + REQ_DISCARD. The first can't be avoided anyway and
> >> there won't be the second flush to begin with, so I don't think this
> >> worsens anything.
> >
> > Makes sense, but your patches still need to be refreshed against the
> > latest (2.6.36-rc1) upstream code. Numerous changes went in to DM
> > recently.
>
> Sure thing. The block part isn't fixed yet and so the RFC tag. Once
> the block layer part is settled, it probably should be pulled into
> dm/md and other trees and conversions should happen there.

Why base your work on a partial 2.6.36 tree? Why not rebase to linus'
2.6.36-rc1?

Once we get the changes in a more suitable state (across the entire
tree) we can split the individual changes out to their respective
trees. Without a comprehensive tree I fear this code isn't going to get
tested or reviewed properly.

For example: any review of DM changes, against stale DM code, is wasted
effort.

> >> * For request based dm:
> >>
> >> * The sequencing is done by the block layer for the top level
> >> request_queue, so the only things request based dm needs to make
> >> sure is 1. handling empty REQ_FLUSH correctly (block layer will
> >> only send down empty REQ_FLUSHes) and 2. propagate REQ_FUA bit to
> >> member devices.
> >
> > OK, so seems 1 is done, 2 is still TODO. Looking at your tree it seems
> > 2 would be as simple as using the following in
>
> Oh, I was talking about the other way around. Passing REQ_FUA in
> bio->bi_rw down to member request_queues. Sometimes while
> constructing clone / split bios, the bit is lost (e.g. md raid5).

Seems we need to change __blk_rq_prep_clone to propagate REQ_FUA just
like REQ_DISCARD: http://git.kernel.org/linus/3383977

Doesn't seem like we need to do the same for REQ_FLUSH (at least not for
rq-based DM's benefit) because dm.c:setup_clone already special cases
flush requests and sets REQ_FLUSH in cmd_flags.

Mike

2010-08-18 06:33:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/5] dm: implement REQ_FLUSH/FUA support

Hello,

On 08/17/2010 08:21 PM, Mike Snitzer wrote:
> Why base your work on a partial 2.6.36 tree? Why not rebase to linus'
> 2.6.36-rc1?

Because the block tree contains enough changes which are not in
mainline yet and bulk of the changes should go through the block tree.

> Once we get the changes in a more suitable state (across the entire
> tree) we can split the individual changes out to their respective
> trees. Without a comprehensive tree I fear this code isn't going to get
> tested or reviewed properly.
>
> For example: any review of DM changes, against stale DM code, is wasted
> effort.

Yeap, sure, it will happen all in a good time, but I don't really
agree that review against block tree would be complete waste of
effort. Things usually don't change that drastically unless dm is
being rewritten as we speak. Anyways, I'll soon post a rebased /
updated patch.

>> Oh, I was talking about the other way around. Passing REQ_FUA in
>> bio->bi_rw down to member request_queues. Sometimes while
>> constructing clone / split bios, the bit is lost (e.g. md raid5).
>
> Seems we need to change __blk_rq_prep_clone to propagate REQ_FUA just
> like REQ_DISCARD: http://git.kernel.org/linus/3383977

Oooh, right. I for some reason thought block layer was already doing
that. I'll update it. Thanks for pointing it out.

> Doesn't seem like we need to do the same for REQ_FLUSH (at least not for
> rq-based DM's benefit) because dm.c:setup_clone already special cases
> flush requests and sets REQ_FLUSH in cmd_flags.

Hmmm... but in general, I think it would be better to make
__blk_rq_prep_clone() to copy all command related bits. If some
command bits shouldn't be copied, the caller should take care of them.

Thanks.

--
tejun

2010-08-18 10:22:10

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 2/5] virtio_blk: implement REQ_FLUSH/FUA support

On Tue, 17 Aug 2010 10:53:27 pm Christoph Hellwig wrote:
> Given how little testing lguest gets compared to qemu I really don't
> want a protocol addition for it unless it really buys us something.
> Once we're done with this barrier conversion I plan into benchmarking
> FUA and a pre-flush tag on the command for virtio in real life setups,
> and see if it actually buys us anything.

Absolutely. Lguest should follow, not lead!

Rusty.

2010-08-19 10:33:09

by Kiyoshi Ueda

[permalink] [raw]
Subject: Re: [PATCH 5/5] dm: implement REQ_FLUSH/FUA support

Hi Tejun, Mike,

On 08/18/2010 01:51 AM +0900, Tejun Heo wrote:
> On 08/17/2010 04:07 PM, Mike Snitzer wrote:
>> NOTE: NEC has already given some preliminary feedback to hch in the
>> "[PATCH, RFC 2/2] dm: support REQ_FLUSH directly" thread:
>> https://www.redhat.com/archives/dm-devel/2010-August/msg00026.html
>> https://www.redhat.com/archives/dm-devel/2010-August/msg00033.html
>
> Hmmm... I think both issues don't exist in this incarnation of
> conversion although I'm fairly sure there will be other issues. :-)

The same issue is still there for request-based dm. See below.


>>> A related question: Is dm_wait_for_completion() used in
>>> process_flush() safe against starvation under continuous influx of
>>> other commands?
>> As for your specific dm_wait_for_completion() concern -- I'll defer to
>> Mikulas. But I'll add: we haven't had any reported starvation issues
>> with DM's existing barrier support. DM uses a mempool for its clones,
>> so it should naturally throttle (without starvation) when memory gets
>> low.
>
> I see but single pending flush and steady write streams w/o saturating
> the mempool would be able to stall dm_wait_for_completeion(), no? Eh
> well, it's a separate issue, I guess.

Your understanding is correct, dm_wait_for_completion() for flush
will stall in such cases for request-based dm.
That's why I mentioned below in
https://www.redhat.com/archives/dm-devel/2010-August/msg00026.html.

In other words, current request-based device-mapper can't handle
other requests while a flush request is in progress.

In flush request handling, request-based dm uses dm_wait_for_completion()
to wait for the completion of cloned flush requests, depending on
the fact that there should be only flush requests in flight owning
to the block layer sequencing.

It's not a separate issue and we need to resolve it at least.
I'm still considering how I can fix the request-based dm.

Thanks,
Kiyoshi Ueda

2010-08-19 15:18:43

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/5 UPDATED] virtio_blk: drop REQ_HARDBARRIER support

Remove now unused REQ_HARDBARRIER support. virtio_blk already
supports REQ_FLUSH and the usefulness of REQ_FUA for virtio_blk is
questionable at this point, so there's nothing else to do to support
new REQ_FLUSH/FUA interface.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
REQ_FUA support dropped as suggested by Christoph.

drivers/block/virtio_blk.c | 17 ++++-------------
1 file changed, 4 insertions(+), 13 deletions(-)

Index: block/drivers/block/virtio_blk.c
===================================================================
--- block.orig/drivers/block/virtio_blk.c
+++ block/drivers/block/virtio_blk.c
@@ -128,9 +128,6 @@ static bool do_req(struct request_queue
}
}

- if (vbr->req->cmd_flags & REQ_HARDBARRIER)
- vbr->out_hdr.type |= VIRTIO_BLK_T_BARRIER;
-
sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));

/*
@@ -388,13 +385,7 @@ static int __devinit virtblk_probe(struc
vblk->disk->driverfs_dev = &vdev->dev;
index++;

- /*
- * If the FLUSH feature is supported we do have support for
- * flushing a volatile write cache on the host. Use that to
- * implement write barrier support; otherwise, we must assume
- * that the host does not perform any kind of volatile write
- * caching.
- */
+ /* configure queue flush support */
if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
blk_queue_flush(q, REQ_FLUSH);

@@ -515,9 +506,9 @@ static const struct virtio_device_id id_
};

static unsigned int features[] = {
- VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
- VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
- VIRTIO_BLK_F_SCSI, VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY
+ VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
+ VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_SCSI,
+ VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY
};

/*

2010-08-19 15:19:52

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/5] lguest: replace VIRTIO_F_BARRIER support with VIRTIO_F_FLUSH support

VIRTIO_F_BARRIER is deprecated. Replace it with VIRTIO_F_FLUSH
support.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
FUA support dropped as suggested by Christoph. Rusty, can you please
ack this version too? I tested it with the updated virtio_blk and it
works fine.

Thanks.

Documentation/lguest/lguest.c | 29 +++++++++--------------------
1 file changed, 9 insertions(+), 20 deletions(-)

Index: block/Documentation/lguest/lguest.c
===================================================================
--- block.orig/Documentation/lguest/lguest.c
+++ block/Documentation/lguest/lguest.c
@@ -1639,15 +1639,6 @@ static void blk_request(struct virtqueue
off = out->sector * 512;

/*
- * The block device implements "barriers", where the Guest indicates
- * that it wants all previous writes to occur before this write. We
- * don't have a way of asking our kernel to do a barrier, so we just
- * synchronize all the data in the file. Pretty poor, no?
- */
- if (out->type & VIRTIO_BLK_T_BARRIER)
- fdatasync(vblk->fd);
-
- /*
* In general the virtio block driver is allowed to try SCSI commands.
* It'd be nice if we supported eject, for example, but we don't.
*/
@@ -1679,6 +1670,13 @@ static void blk_request(struct virtqueue
/* Die, bad Guest, die. */
errx(1, "Write past end %llu+%u", off, ret);
}
+
+ wlen = sizeof(*in);
+ *in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
+ } else if (out->type & VIRTIO_BLK_T_FLUSH) {
+ /* Flush */
+ ret = fdatasync(vblk->fd);
+ verbose("FLUSH fdatasync: %i\n", ret);
wlen = sizeof(*in);
*in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
} else {
@@ -1702,15 +1700,6 @@ static void blk_request(struct virtqueue
}
}

- /*
- * OK, so we noted that it was pretty poor to use an fdatasync as a
- * barrier. But Christoph Hellwig points out that we need a sync
- * *afterwards* as well: "Barriers specify no reordering to the front
- * or the back." And Jens Axboe confirmed it, so here we are:
- */
- if (out->type & VIRTIO_BLK_T_BARRIER)
- fdatasync(vblk->fd);
-
/* Finished that request. */
add_used(vq, head, wlen);
}
@@ -1735,8 +1724,8 @@ static void setup_block_file(const char
vblk->fd = open_or_die(filename, O_RDWR|O_LARGEFILE);
vblk->len = lseek64(vblk->fd, 0, SEEK_END);

- /* We support barriers. */
- add_feature(dev, VIRTIO_BLK_F_BARRIER);
+ /* We support FLUSH. */
+ add_feature(dev, VIRTIO_BLK_F_FLUSH);

/* Tell Guest how many sectors this device has. */
conf.capacity = cpu_to_le64(vblk->len / 512);

2010-08-19 15:50:06

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/5] dm: implement REQ_FLUSH/FUA support

Hello,

On 08/19/2010 12:32 PM, Kiyoshi Ueda wrote:
>> I see but single pending flush and steady write streams w/o saturating
>> the mempool would be able to stall dm_wait_for_completeion(), no? Eh
>> well, it's a separate issue, I guess.
>
> Your understanding is correct, dm_wait_for_completion() for flush
> will stall in such cases for request-based dm.
> That's why I mentioned below in
> https://www.redhat.com/archives/dm-devel/2010-August/msg00026.html.
>
> In other words, current request-based device-mapper can't handle
> other requests while a flush request is in progress.
>
> In flush request handling, request-based dm uses dm_wait_for_completion()
> to wait for the completion of cloned flush requests, depending on
> the fact that there should be only flush requests in flight owning
> to the block layer sequencing.

I see. bio based implementation also uses dm_wait_for_completion()
but it also has DMF_QUEUE_IO_TO_THREAD to plug all the follow up bio's
while flush is in progress, which sucks for throughput but
successfully avoids starvation.

> It's not a separate issue and we need to resolve it at least.
> I'm still considering how I can fix the request-based dm.

Right, I thought you were talking about REQ_FLUSHes not sycnhronized
against barrier write. Anyways, yeah, it's a problem. I don't think
not being able to handle multiple flushes concurrently would be a
major issue. The problem is not being able to process other
bios/requests while a flush is in progress. All that's necessary is
making the completion detection a bit more fine grained so that it
counts the number of in flight flush bios/requests and completes when
it reaches zero instead of waiting for all outstanding commands.
Shouldn't be too hard.

Thanks.

--
tejun

2010-08-23 16:47:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCHSET block#for-2.6.36-post] block: convert to REQ_FLUSH/FUA

Phillipp and Lars,

can you look at the changes in

http://www.spinics.net/lists/linux-fsdevel/msg35884.html

and

http://www.spinics.net/lists/linux-fsdevel/msg36001.html

and see how drbd can be adapted to it? It's one of two drivers still
missing, and because it's not quite intuitive what it's doing it's
very hard for Tejun and me to help.

2010-08-24 05:41:24

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 4/5] md: implment REQ_FLUSH/FUA support

On Mon, 16 Aug 2010 18:52:02 +0200
Tejun Heo <[email protected]> wrote:


Hi Tejun,
thanks for doing this.
It mostly looks good, especially ...


> * REQ_FLUSH/FUA failures are final and its users don't need retry
> logic. Retry logic is removed.

This bit - all that retry logic felt so clumsy :-)

Only change I would make is:

>
> @@ -4083,7 +4089,7 @@ static int make_request(mddev_t *mddev, struct bio * bi)
> finish_wait(&conf->wait_for_overlap, &w);
> set_bit(STRIPE_HANDLE, &sh->state);
> clear_bit(STRIPE_DELAYED, &sh->state);
> - if (mddev->barrier &&
> + if (mddev->flush_bio &&
> !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> atomic_inc(&conf->preread_active_stripes);
> release_stripe(sh);
> @@ -4106,7 +4112,7 @@ static int make_request(mddev_t *mddev, struct bio * bi)
> bio_endio(bi, 0);
> }
>
> - if (mddev->barrier) {
> + if (mddev->flush_bio) {
> /* We need to wait for the stripes to all be handled.
> * So: wait for preread_active_stripes to drop to 0.
> */

These two in raid5.c aren't quite right.
The first should be changed to test
bi->bi_rw & REQ_SYNC
rather than
mddev->flush_bio.
(Assuming the REQ_SYNC means "don't bother waiting for more requests that
might combine with this one to make it all go faster" which I think it does.)

For the second we can just drop the whole if statement.
It was needed so that the all the writes would go done to the underlying
devices so that the null-barrier which would subsequently be passed to all
those devices would go *after* the writes for the barrier request.
As there is no longer a post-flush, that code can go.

Thanks a lot, and sorry for the delay in reviewing it.
NeilBrown


> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 0f86f5e..ff9cad2 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -275,6 +275,7 @@ struct r6_state {
> * filling
> */
> #define R5_Wantdrain 13 /* dev->towrite needs to be drained */
> +#define R5_WantFUA 14 /* Write should be FUA */
> /*
> * Write method
> */

2010-08-24 09:51:42

by Lars Ellenberg

[permalink] [raw]
Subject: Re: [RFC PATCHSET block#for-2.6.36-post] block: convert to REQ_FLUSH/FUA

On Mon, Aug 23, 2010 at 06:47:09PM +0200, Christoph Hellwig wrote:
> Phillipp and Lars,
>
> can you look at the changes in
>
> http://www.spinics.net/lists/linux-fsdevel/msg35884.html
>
> and
>
> http://www.spinics.net/lists/linux-fsdevel/msg36001.html
>
> and see how drbd can be adapted to it?

Yes of course.
I have been closely following the discussion related to this patchset.

> It's one of two drivers still
> missing, and because it's not quite intuitive what it's doing it's
> very hard for Tejun and me to help.

I'm currently on vacation, actually, and Phil is likely very busy,
but we should have a patche ready within the next few days anyways.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD? and LINBIT? are registered trademarks of LINBIT, Austria.

2010-08-24 15:46:15

by Philipp Reisner

[permalink] [raw]
Subject: Re: [RFC PATCHSET block#for-2.6.36-post] block: convert to REQ_FLUSH/FUA

Am Montag, 23. August 2010, um 18:47:09 schrieb Christoph Hellwig:
> Phillipp and Lars,
>
> can you look at the changes in
>
> http://www.spinics.net/lists/linux-fsdevel/msg35884.html
>
> and
>
> http://www.spinics.net/lists/linux-fsdevel/msg36001.html
>
> and see how drbd can be adapted to it? It's one of two drivers still
> missing, and because it's not quite intuitive what it's doing it's
> very hard for Tejun and me to help.

Hi Christoph,

I was able to finish the greater part of the necessary work today.
You can find it there:

git://git.drbd.org/linux-2.6-drbd.git flush-fua

Of course that is based on

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua

REQ_HARDBARRIER is not yet completely eliminated from drbd, but I
am very confident that I will be able to finish that by tomorrow.

Best,
Phil
--
: Dipl-Ing Philipp Reisner
: LINBIT | Your Way to High Availability
: Tel: +43-1-8178292-50, Fax: +43-1-8178292-82
: http://www.linbit.com

DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.

2010-08-25 11:28:21

by Tejun Heo

[permalink] [raw]
Subject: [PATCH UPDATED 4/5] md: implment REQ_FLUSH/FUA support

This patch converts md to support REQ_FLUSH/FUA instead of now
deprecated REQ_HARDBARRIER. In the core part (md.c), the following
changes are notable.

* Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
processing of other requests and thus there is no reason to mark the
queue congested while FLUSH/FUA is in progress.

* REQ_FLUSH/FUA failures are final and its users don't need retry
logic. Retry logic is removed.

* Preflush needs to be issued to all member devices but FUA writes can
be handled the same way as other writes - their processing can be
deferred to request_queue of member devices. md_barrier_request()
is renamed to md_flush_request() and simplified accordingly.

For linear, raid0 and multipath, the core changes are enough. raid1,
5 and 10 need the following conversions.

* raid1: Handling of FLUSH/FUA bio's can simply be deferred to
request_queues of member devices. Barrier related logic removed.

* raid5: Queue draining logic dropped. FUA bit is propagated through
biodrain and stripe resconstruction such that all the updated parts
of the stripe are written out with FUA writes if any of the dirtying
writes was FUA. preread_active_stripes handling in make_request()
is updated as suggested by Neil Brown.

* raid10: FUA bit needs to be propagated to write clones.

linear, raid0, 1, 5 and 10 tested.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Neil Brown <[email protected]>
---
Rebased on top of -rc2 and updated as suggested. Can you please
review and ack it?

Thanks.

drivers/md/linear.c | 4 -
drivers/md/md.c | 117 +++++++--------------------------
drivers/md/md.h | 23 +-----
drivers/md/multipath.c | 4 -
drivers/md/raid0.c | 4 -
drivers/md/raid1.c | 173 ++++++++++++++++---------------------------------
drivers/md/raid1.h | 2
drivers/md/raid10.c | 7 +
drivers/md/raid5.c | 43 +++++-------
drivers/md/raid5.h | 1
10 files changed, 121 insertions(+), 257 deletions(-)

Index: block/drivers/md/linear.c
===================================================================
--- block.orig/drivers/md/linear.c
+++ block/drivers/md/linear.c
@@ -294,8 +294,8 @@ static int linear_make_request (mddev_t
dev_info_t *tmp_dev;
sector_t start_sector;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

Index: block/drivers/md/md.c
===================================================================
--- block.orig/drivers/md/md.c
+++ block/drivers/md/md.c
@@ -226,12 +226,12 @@ static int md_make_request(struct reques
return 0;
}
rcu_read_lock();
- if (mddev->suspended || mddev->barrier) {
+ if (mddev->suspended) {
DEFINE_WAIT(__wait);
for (;;) {
prepare_to_wait(&mddev->sb_wait, &__wait,
TASK_UNINTERRUPTIBLE);
- if (!mddev->suspended && !mddev->barrier)
+ if (!mddev->suspended)
break;
rcu_read_unlock();
schedule();
@@ -282,40 +282,29 @@ EXPORT_SYMBOL_GPL(mddev_resume);

int mddev_congested(mddev_t *mddev, int bits)
{
- if (mddev->barrier)
- return 1;
return mddev->suspended;
}
EXPORT_SYMBOL(mddev_congested);

/*
- * Generic barrier handling for md
+ * Generic flush handling for md
*/

-#define POST_REQUEST_BARRIER ((void*)1)
-
-static void md_end_barrier(struct bio *bio, int err)
+static void md_end_flush(struct bio *bio, int err)
{
mdk_rdev_t *rdev = bio->bi_private;
mddev_t *mddev = rdev->mddev;
- if (err == -EOPNOTSUPP && mddev->barrier != POST_REQUEST_BARRIER)
- set_bit(BIO_EOPNOTSUPP, &mddev->barrier->bi_flags);

rdev_dec_pending(rdev, mddev);

if (atomic_dec_and_test(&mddev->flush_pending)) {
- if (mddev->barrier == POST_REQUEST_BARRIER) {
- /* This was a post-request barrier */
- mddev->barrier = NULL;
- wake_up(&mddev->sb_wait);
- } else
- /* The pre-request barrier has finished */
- schedule_work(&mddev->barrier_work);
+ /* The pre-request flush has finished */
+ schedule_work(&mddev->flush_work);
}
bio_put(bio);
}

-static void submit_barriers(mddev_t *mddev)
+static void submit_flushes(mddev_t *mddev)
{
mdk_rdev_t *rdev;

@@ -332,60 +321,56 @@ static void submit_barriers(mddev_t *mdd
atomic_inc(&rdev->nr_pending);
rcu_read_unlock();
bi = bio_alloc(GFP_KERNEL, 0);
- bi->bi_end_io = md_end_barrier;
+ bi->bi_end_io = md_end_flush;
bi->bi_private = rdev;
bi->bi_bdev = rdev->bdev;
atomic_inc(&mddev->flush_pending);
- submit_bio(WRITE_BARRIER, bi);
+ submit_bio(WRITE_FLUSH, bi);
rcu_read_lock();
rdev_dec_pending(rdev, mddev);
}
rcu_read_unlock();
}

-static void md_submit_barrier(struct work_struct *ws)
+static void md_submit_flush_data(struct work_struct *ws)
{
- mddev_t *mddev = container_of(ws, mddev_t, barrier_work);
- struct bio *bio = mddev->barrier;
+ mddev_t *mddev = container_of(ws, mddev_t, flush_work);
+ struct bio *bio = mddev->flush_bio;

atomic_set(&mddev->flush_pending, 1);

- if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags))
- bio_endio(bio, -EOPNOTSUPP);
- else if (bio->bi_size == 0)
+ if (bio->bi_size == 0)
/* an empty barrier - all done */
bio_endio(bio, 0);
else {
- bio->bi_rw &= ~REQ_HARDBARRIER;
+ bio->bi_rw &= ~REQ_FLUSH;
if (mddev->pers->make_request(mddev, bio))
generic_make_request(bio);
- mddev->barrier = POST_REQUEST_BARRIER;
- submit_barriers(mddev);
}
if (atomic_dec_and_test(&mddev->flush_pending)) {
- mddev->barrier = NULL;
+ mddev->flush_bio = NULL;
wake_up(&mddev->sb_wait);
}
}

-void md_barrier_request(mddev_t *mddev, struct bio *bio)
+void md_flush_request(mddev_t *mddev, struct bio *bio)
{
spin_lock_irq(&mddev->write_lock);
wait_event_lock_irq(mddev->sb_wait,
- !mddev->barrier,
+ !mddev->flush_bio,
mddev->write_lock, /*nothing*/);
- mddev->barrier = bio;
+ mddev->flush_bio = bio;
spin_unlock_irq(&mddev->write_lock);

atomic_set(&mddev->flush_pending, 1);
- INIT_WORK(&mddev->barrier_work, md_submit_barrier);
+ INIT_WORK(&mddev->flush_work, md_submit_flush_data);

- submit_barriers(mddev);
+ submit_flushes(mddev);

if (atomic_dec_and_test(&mddev->flush_pending))
- schedule_work(&mddev->barrier_work);
+ schedule_work(&mddev->flush_work);
}
-EXPORT_SYMBOL(md_barrier_request);
+EXPORT_SYMBOL(md_flush_request);

/* Support for plugging.
* This mirrors the plugging support in request_queue, but does not
@@ -696,31 +681,6 @@ static void super_written(struct bio *bi
bio_put(bio);
}

-static void super_written_barrier(struct bio *bio, int error)
-{
- struct bio *bio2 = bio->bi_private;
- mdk_rdev_t *rdev = bio2->bi_private;
- mddev_t *mddev = rdev->mddev;
-
- if (!test_bit(BIO_UPTODATE, &bio->bi_flags) &&
- error == -EOPNOTSUPP) {
- unsigned long flags;
- /* barriers don't appear to be supported :-( */
- set_bit(BarriersNotsupp, &rdev->flags);
- mddev->barriers_work = 0;
- spin_lock_irqsave(&mddev->write_lock, flags);
- bio2->bi_next = mddev->biolist;
- mddev->biolist = bio2;
- spin_unlock_irqrestore(&mddev->write_lock, flags);
- wake_up(&mddev->sb_wait);
- bio_put(bio);
- } else {
- bio_put(bio2);
- bio->bi_private = rdev;
- super_written(bio, error);
- }
-}
-
void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
sector_t sector, int size, struct page *page)
{
@@ -729,51 +689,28 @@ void md_super_write(mddev_t *mddev, mdk_
* and decrement it on completion, waking up sb_wait
* if zero is reached.
* If an error occurred, call md_error
- *
- * As we might need to resubmit the request if REQ_HARDBARRIER
- * causes ENOTSUPP, we allocate a spare bio...
*/
struct bio *bio = bio_alloc(GFP_NOIO, 1);
- int rw = REQ_WRITE | REQ_SYNC | REQ_UNPLUG;

bio->bi_bdev = rdev->bdev;
bio->bi_sector = sector;
bio_add_page(bio, page, size, 0);
bio->bi_private = rdev;
bio->bi_end_io = super_written;
- bio->bi_rw = rw;

atomic_inc(&mddev->pending_writes);
- if (!test_bit(BarriersNotsupp, &rdev->flags)) {
- struct bio *rbio;
- rw |= REQ_HARDBARRIER;
- rbio = bio_clone(bio, GFP_NOIO);
- rbio->bi_private = bio;
- rbio->bi_end_io = super_written_barrier;
- submit_bio(rw, rbio);
- } else
- submit_bio(rw, bio);
+ submit_bio(REQ_WRITE | REQ_SYNC | REQ_UNPLUG | REQ_FLUSH | REQ_FUA,
+ bio);
}

void md_super_wait(mddev_t *mddev)
{
- /* wait for all superblock writes that were scheduled to complete.
- * if any had to be retried (due to BARRIER problems), retry them
- */
+ /* wait for all superblock writes that were scheduled to complete */
DEFINE_WAIT(wq);
for(;;) {
prepare_to_wait(&mddev->sb_wait, &wq, TASK_UNINTERRUPTIBLE);
if (atomic_read(&mddev->pending_writes)==0)
break;
- while (mddev->biolist) {
- struct bio *bio;
- spin_lock_irq(&mddev->write_lock);
- bio = mddev->biolist;
- mddev->biolist = bio->bi_next ;
- bio->bi_next = NULL;
- spin_unlock_irq(&mddev->write_lock);
- submit_bio(bio->bi_rw, bio);
- }
schedule();
}
finish_wait(&mddev->sb_wait, &wq);
@@ -1070,7 +1007,6 @@ static int super_90_validate(mddev_t *md
clear_bit(Faulty, &rdev->flags);
clear_bit(In_sync, &rdev->flags);
clear_bit(WriteMostly, &rdev->flags);
- clear_bit(BarriersNotsupp, &rdev->flags);

if (mddev->raid_disks == 0) {
mddev->major_version = 0;
@@ -1485,7 +1421,6 @@ static int super_1_validate(mddev_t *mdd
clear_bit(Faulty, &rdev->flags);
clear_bit(In_sync, &rdev->flags);
clear_bit(WriteMostly, &rdev->flags);
- clear_bit(BarriersNotsupp, &rdev->flags);

if (mddev->raid_disks == 0) {
mddev->major_version = 1;
@@ -4506,7 +4441,6 @@ int md_run(mddev_t *mddev)
/* may be over-ridden by personality */
mddev->resync_max_sectors = mddev->dev_sectors;

- mddev->barriers_work = 1;
mddev->ok_start_degraded = start_dirty_degraded;

if (start_readonly && mddev->ro == 0)
@@ -4685,7 +4619,6 @@ static void md_clean(mddev_t *mddev)
mddev->recovery = 0;
mddev->in_sync = 0;
mddev->degraded = 0;
- mddev->barriers_work = 0;
mddev->safemode = 0;
mddev->bitmap_info.offset = 0;
mddev->bitmap_info.default_offset = 0;
Index: block/drivers/md/md.h
===================================================================
--- block.orig/drivers/md/md.h
+++ block/drivers/md/md.h
@@ -87,7 +87,6 @@ struct mdk_rdev_s
#define Faulty 1 /* device is known to have a fault */
#define In_sync 2 /* device is in_sync with rest of array */
#define WriteMostly 4 /* Avoid reading if at all possible */
-#define BarriersNotsupp 5 /* REQ_HARDBARRIER is not supported */
#define AllReserved 6 /* If whole device is reserved for
* one array */
#define AutoDetected 7 /* added by auto-detect */
@@ -273,13 +272,6 @@ struct mddev_s
int degraded; /* whether md should consider
* adding a spare
*/
- int barriers_work; /* initialised to true, cleared as soon
- * as a barrier request to slave
- * fails. Only supported
- */
- struct bio *biolist; /* bios that need to be retried
- * because REQ_HARDBARRIER is not supported
- */

atomic_t recovery_active; /* blocks scheduled, but not written */
wait_queue_head_t recovery_wait;
@@ -339,16 +331,13 @@ struct mddev_s
struct attribute_group *to_remove;
struct plug_handle *plug; /* if used by personality */

- /* Generic barrier handling.
- * If there is a pending barrier request, all other
- * writes are blocked while the devices are flushed.
- * The last to finish a flush schedules a worker to
- * submit the barrier request (without the barrier flag),
- * then submit more flush requests.
+ /* Generic flush handling.
+ * The last to finish preflush schedules a worker to submit
+ * the rest of the request (without the REQ_FLUSH flag).
*/
- struct bio *barrier;
+ struct bio *flush_bio;
atomic_t flush_pending;
- struct work_struct barrier_work;
+ struct work_struct flush_work;
struct work_struct event_work; /* used by dm to report failure event */
};

@@ -502,7 +491,7 @@ extern void md_done_sync(mddev_t *mddev,
extern void md_error(mddev_t *mddev, mdk_rdev_t *rdev);

extern int mddev_congested(mddev_t *mddev, int bits);
-extern void md_barrier_request(mddev_t *mddev, struct bio *bio);
+extern void md_flush_request(mddev_t *mddev, struct bio *bio);
extern void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
sector_t sector, int size, struct page *page);
extern void md_super_wait(mddev_t *mddev);
Index: block/drivers/md/raid0.c
===================================================================
--- block.orig/drivers/md/raid0.c
+++ block/drivers/md/raid0.c
@@ -483,8 +483,8 @@ static int raid0_make_request(mddev_t *m
struct strip_zone *zone;
mdk_rdev_t *tmp_dev;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

Index: block/drivers/md/raid1.c
===================================================================
--- block.orig/drivers/md/raid1.c
+++ block/drivers/md/raid1.c
@@ -319,83 +319,74 @@ static void raid1_end_write_request(stru
if (r1_bio->bios[mirror] == bio)
break;

- if (error == -EOPNOTSUPP && test_bit(R1BIO_Barrier, &r1_bio->state)) {
- set_bit(BarriersNotsupp, &conf->mirrors[mirror].rdev->flags);
- set_bit(R1BIO_BarrierRetry, &r1_bio->state);
- r1_bio->mddev->barriers_work = 0;
- /* Don't rdev_dec_pending in this branch - keep it for the retry */
- } else {
+ /*
+ * 'one mirror IO has finished' event handler:
+ */
+ r1_bio->bios[mirror] = NULL;
+ to_put = bio;
+ if (!uptodate) {
+ md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
+ /* an I/O failed, we can't clear the bitmap */
+ set_bit(R1BIO_Degraded, &r1_bio->state);
+ } else
/*
- * this branch is our 'one mirror IO has finished' event handler:
+ * Set R1BIO_Uptodate in our master bio, so that we
+ * will return a good error code for to the higher
+ * levels even if IO on some other mirrored buffer
+ * fails.
+ *
+ * The 'master' represents the composite IO operation
+ * to user-side. So if something waits for IO, then it
+ * will wait for the 'master' bio.
*/
- r1_bio->bios[mirror] = NULL;
- to_put = bio;
- if (!uptodate) {
- md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
- /* an I/O failed, we can't clear the bitmap */
- set_bit(R1BIO_Degraded, &r1_bio->state);
- } else
- /*
- * Set R1BIO_Uptodate in our master bio, so that
- * we will return a good error code for to the higher
- * levels even if IO on some other mirrored buffer fails.
- *
- * The 'master' represents the composite IO operation to
- * user-side. So if something waits for IO, then it will
- * wait for the 'master' bio.
- */
- set_bit(R1BIO_Uptodate, &r1_bio->state);
+ set_bit(R1BIO_Uptodate, &r1_bio->state);
+
+ update_head_pos(mirror, r1_bio);

- update_head_pos(mirror, r1_bio);
+ if (behind) {
+ if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
+ atomic_dec(&r1_bio->behind_remaining);

- if (behind) {
- if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
- atomic_dec(&r1_bio->behind_remaining);
-
- /* In behind mode, we ACK the master bio once the I/O has safely
- * reached all non-writemostly disks. Setting the Returned bit
- * ensures that this gets done only once -- we don't ever want to
- * return -EIO here, instead we'll wait */
-
- if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
- test_bit(R1BIO_Uptodate, &r1_bio->state)) {
- /* Maybe we can return now */
- if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
- struct bio *mbio = r1_bio->master_bio;
- PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu\n",
- (unsigned long long) mbio->bi_sector,
- (unsigned long long) mbio->bi_sector +
- (mbio->bi_size >> 9) - 1);
- bio_endio(mbio, 0);
- }
+ /*
+ * In behind mode, we ACK the master bio once the I/O
+ * has safely reached all non-writemostly
+ * disks. Setting the Returned bit ensures that this
+ * gets done only once -- we don't ever want to return
+ * -EIO here, instead we'll wait
+ */
+ if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
+ test_bit(R1BIO_Uptodate, &r1_bio->state)) {
+ /* Maybe we can return now */
+ if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
+ struct bio *mbio = r1_bio->master_bio;
+ PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu\n",
+ (unsigned long long) mbio->bi_sector,
+ (unsigned long long) mbio->bi_sector +
+ (mbio->bi_size >> 9) - 1);
+ bio_endio(mbio, 0);
}
}
- rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
}
+ rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
+
/*
- *
* Let's see if all mirrored write operations have finished
* already.
*/
if (atomic_dec_and_test(&r1_bio->remaining)) {
- if (test_bit(R1BIO_BarrierRetry, &r1_bio->state))
- reschedule_retry(r1_bio);
- else {
- /* it really is the end of this request */
- if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
- /* free extra copy of the data pages */
- int i = bio->bi_vcnt;
- while (i--)
- safe_put_page(bio->bi_io_vec[i].bv_page);
- }
- /* clear the bitmap if all writes complete successfully */
- bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
- r1_bio->sectors,
- !test_bit(R1BIO_Degraded, &r1_bio->state),
- behind);
- md_write_end(r1_bio->mddev);
- raid_end_bio_io(r1_bio);
- }
+ if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
+ /* free extra copy of the data pages */
+ int i = bio->bi_vcnt;
+ while (i--)
+ safe_put_page(bio->bi_io_vec[i].bv_page);
+ }
+ /* clear the bitmap if all writes complete successfully */
+ bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
+ r1_bio->sectors,
+ !test_bit(R1BIO_Degraded, &r1_bio->state),
+ behind);
+ md_write_end(r1_bio->mddev);
+ raid_end_bio_io(r1_bio);
}

if (to_put)
@@ -788,6 +779,7 @@ static int make_request(mddev_t *mddev,
struct page **behind_pages = NULL;
const int rw = bio_data_dir(bio);
const unsigned long do_sync = (bio->bi_rw & REQ_SYNC);
+ const unsigned long do_flush_fua = (bio->bi_rw & (REQ_FLUSH | REQ_FUA));
unsigned long do_barriers;
mdk_rdev_t *blocked_rdev;

@@ -795,9 +787,6 @@ static int make_request(mddev_t *mddev,
* Register the new request and wait if the reconstruction
* thread has put up a bar for new requests.
* Continue immediately if no resync is active currently.
- * We test barriers_work *after* md_write_start as md_write_start
- * may cause the first superblock write, and that will check out
- * if barriers work.
*/

md_write_start(mddev, bio); /* wait on superblock update early */
@@ -821,13 +810,6 @@ static int make_request(mddev_t *mddev,
}
finish_wait(&conf->wait_barrier, &w);
}
- if (unlikely(!mddev->barriers_work &&
- (bio->bi_rw & REQ_HARDBARRIER))) {
- if (rw == WRITE)
- md_write_end(mddev);
- bio_endio(bio, -EOPNOTSUPP);
- return 0;
- }

wait_barrier(conf);

@@ -959,10 +941,6 @@ static int make_request(mddev_t *mddev,
atomic_set(&r1_bio->remaining, 0);
atomic_set(&r1_bio->behind_remaining, 0);

- do_barriers = bio->bi_rw & REQ_HARDBARRIER;
- if (do_barriers)
- set_bit(R1BIO_Barrier, &r1_bio->state);
-
bio_list_init(&bl);
for (i = 0; i < disks; i++) {
struct bio *mbio;
@@ -975,7 +953,7 @@ static int make_request(mddev_t *mddev,
mbio->bi_sector = r1_bio->sector + conf->mirrors[i].rdev->data_offset;
mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
mbio->bi_end_io = raid1_end_write_request;
- mbio->bi_rw = WRITE | do_barriers | do_sync;
+ mbio->bi_rw = WRITE | do_flush_fua | do_sync;
mbio->bi_private = r1_bio;

if (behind_pages) {
@@ -1634,41 +1612,6 @@ static void raid1d(mddev_t *mddev)
if (test_bit(R1BIO_IsSync, &r1_bio->state)) {
sync_request_write(mddev, r1_bio);
unplug = 1;
- } else if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) {
- /* some requests in the r1bio were REQ_HARDBARRIER
- * requests which failed with -EOPNOTSUPP. Hohumm..
- * Better resubmit without the barrier.
- * We know which devices to resubmit for, because
- * all others have had their bios[] entry cleared.
- * We already have a nr_pending reference on these rdevs.
- */
- int i;
- const unsigned long do_sync = (r1_bio->master_bio->bi_rw & REQ_SYNC);
- clear_bit(R1BIO_BarrierRetry, &r1_bio->state);
- clear_bit(R1BIO_Barrier, &r1_bio->state);
- for (i=0; i < conf->raid_disks; i++)
- if (r1_bio->bios[i])
- atomic_inc(&r1_bio->remaining);
- for (i=0; i < conf->raid_disks; i++)
- if (r1_bio->bios[i]) {
- struct bio_vec *bvec;
- int j;
-
- bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
- /* copy pages from the failed bio, as
- * this might be a write-behind device */
- __bio_for_each_segment(bvec, bio, j, 0)
- bvec->bv_page = bio_iovec_idx(r1_bio->bios[i], j)->bv_page;
- bio_put(r1_bio->bios[i]);
- bio->bi_sector = r1_bio->sector +
- conf->mirrors[i].rdev->data_offset;
- bio->bi_bdev = conf->mirrors[i].rdev->bdev;
- bio->bi_end_io = raid1_end_write_request;
- bio->bi_rw = WRITE | do_sync;
- bio->bi_private = r1_bio;
- r1_bio->bios[i] = bio;
- generic_make_request(bio);
- }
} else {
int disk;

Index: block/drivers/md/raid1.h
===================================================================
--- block.orig/drivers/md/raid1.h
+++ block/drivers/md/raid1.h
@@ -117,8 +117,6 @@ struct r1bio_s {
#define R1BIO_IsSync 1
#define R1BIO_Degraded 2
#define R1BIO_BehindIO 3
-#define R1BIO_Barrier 4
-#define R1BIO_BarrierRetry 5
/* For write-behind requests, we call bi_end_io when
* the last non-write-behind device completes, providing
* any write was successful. Otherwise we call when
Index: block/drivers/md/raid5.c
===================================================================
--- block.orig/drivers/md/raid5.c
+++ block/drivers/md/raid5.c
@@ -506,9 +506,12 @@ static void ops_run_io(struct stripe_hea
int rw;
struct bio *bi;
mdk_rdev_t *rdev;
- if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags))
- rw = WRITE;
- else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
+ if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags)) {
+ if (test_and_clear_bit(R5_WantFUA, &sh->dev[i].flags))
+ rw = WRITE_FUA;
+ else
+ rw = WRITE;
+ } else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
rw = READ;
else
continue;
@@ -1031,6 +1034,8 @@ ops_run_biodrain(struct stripe_head *sh,

while (wbi && wbi->bi_sector <
dev->sector + STRIPE_SECTORS) {
+ if (wbi->bi_rw & REQ_FUA)
+ set_bit(R5_WantFUA, &dev->flags);
tx = async_copy_data(1, wbi, dev->page,
dev->sector, tx);
wbi = r5_next_bio(wbi, dev->sector);
@@ -1048,15 +1053,22 @@ static void ops_complete_reconstruct(voi
int pd_idx = sh->pd_idx;
int qd_idx = sh->qd_idx;
int i;
+ bool fua = false;

pr_debug("%s: stripe %llu\n", __func__,
(unsigned long long)sh->sector);

+ for (i = disks; i--; )
+ fua |= test_bit(R5_WantFUA, &sh->dev[i].flags);
+
for (i = disks; i--; ) {
struct r5dev *dev = &sh->dev[i];

- if (dev->written || i == pd_idx || i == qd_idx)
+ if (dev->written || i == pd_idx || i == qd_idx) {
set_bit(R5_UPTODATE, &dev->flags);
+ if (fua)
+ set_bit(R5_WantFUA, &dev->flags);
+ }
}

if (sh->reconstruct_state == reconstruct_state_drain_run)
@@ -3281,7 +3293,7 @@ static void handle_stripe5(struct stripe

if (dec_preread_active) {
/* We delay this until after ops_run_io so that if make_request
- * is waiting on a barrier, it won't continue until the writes
+ * is waiting on a flush, it won't continue until the writes
* have actually been submitted.
*/
atomic_dec(&conf->preread_active_stripes);
@@ -3583,7 +3595,7 @@ static void handle_stripe6(struct stripe

if (dec_preread_active) {
/* We delay this until after ops_run_io so that if make_request
- * is waiting on a barrier, it won't continue until the writes
+ * is waiting on a flush, it won't continue until the writes
* have actually been submitted.
*/
atomic_dec(&conf->preread_active_stripes);
@@ -3978,14 +3990,8 @@ static int make_request(mddev_t *mddev,
const int rw = bio_data_dir(bi);
int remaining;

- if (unlikely(bi->bi_rw & REQ_HARDBARRIER)) {
- /* Drain all pending writes. We only really need
- * to ensure they have been submitted, but this is
- * easier.
- */
- mddev->pers->quiesce(mddev, 1);
- mddev->pers->quiesce(mddev, 0);
- md_barrier_request(mddev, bi);
+ if (unlikely(bi->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bi);
return 0;
}

@@ -4103,7 +4109,7 @@ static int make_request(mddev_t *mddev,
finish_wait(&conf->wait_for_overlap, &w);
set_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);
- if (mddev->barrier &&
+ if ((bi->bi_rw & REQ_SYNC) &&
!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
atomic_inc(&conf->preread_active_stripes);
release_stripe(sh);
@@ -4126,13 +4132,6 @@ static int make_request(mddev_t *mddev,
bio_endio(bi, 0);
}

- if (mddev->barrier) {
- /* We need to wait for the stripes to all be handled.
- * So: wait for preread_active_stripes to drop to 0.
- */
- wait_event(mddev->thread->wqueue,
- atomic_read(&conf->preread_active_stripes) == 0);
- }
return 0;
}

Index: block/drivers/md/multipath.c
===================================================================
--- block.orig/drivers/md/multipath.c
+++ block/drivers/md/multipath.c
@@ -142,8 +142,8 @@ static int multipath_make_request(mddev_
struct multipath_bh * mp_bh;
struct multipath_info *multipath;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

Index: block/drivers/md/raid10.c
===================================================================
--- block.orig/drivers/md/raid10.c
+++ block/drivers/md/raid10.c
@@ -800,12 +800,13 @@ static int make_request(mddev_t *mddev,
int chunk_sects = conf->chunk_mask + 1;
const int rw = bio_data_dir(bio);
const unsigned long do_sync = (bio->bi_rw & REQ_SYNC);
+ const unsigned long do_fua = (bio->bi_rw & REQ_FUA);
struct bio_list bl;
unsigned long flags;
mdk_rdev_t *blocked_rdev;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

@@ -965,7 +966,7 @@ static int make_request(mddev_t *mddev,
conf->mirrors[d].rdev->data_offset;
mbio->bi_bdev = conf->mirrors[d].rdev->bdev;
mbio->bi_end_io = raid10_end_write_request;
- mbio->bi_rw = WRITE | do_sync;
+ mbio->bi_rw = WRITE | do_sync | do_fua;
mbio->bi_private = r10_bio;

atomic_inc(&r10_bio->remaining);
Index: block/drivers/md/raid5.h
===================================================================
--- block.orig/drivers/md/raid5.h
+++ block/drivers/md/raid5.h
@@ -275,6 +275,7 @@ struct r6_state {
* filling
*/
#define R5_Wantdrain 13 /* dev->towrite needs to be drained */
+#define R5_WantFUA 14 /* Write should be FUA */
/*
* Write method
*/

2010-08-25 11:43:08

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH UPDATED 4/5] md: implment REQ_FLUSH/FUA support

On Wed, 25 Aug 2010 13:22:26 +0200
Tejun Heo <[email protected]> wrote:

> This patch converts md to support REQ_FLUSH/FUA instead of now
> deprecated REQ_HARDBARRIER. In the core part (md.c), the following
> changes are notable.
>
> * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
> processing of other requests and thus there is no reason to mark the
> queue congested while FLUSH/FUA is in progress.
>
> * REQ_FLUSH/FUA failures are final and its users don't need retry
> logic. Retry logic is removed.
>
> * Preflush needs to be issued to all member devices but FUA writes can
> be handled the same way as other writes - their processing can be
> deferred to request_queue of member devices. md_barrier_request()
> is renamed to md_flush_request() and simplified accordingly.
>
> For linear, raid0 and multipath, the core changes are enough. raid1,
> 5 and 10 need the following conversions.
>
> * raid1: Handling of FLUSH/FUA bio's can simply be deferred to
> request_queues of member devices. Barrier related logic removed.
>
> * raid5: Queue draining logic dropped. FUA bit is propagated through
> biodrain and stripe resconstruction such that all the updated parts
> of the stripe are written out with FUA writes if any of the dirtying
> writes was FUA. preread_active_stripes handling in make_request()
> is updated as suggested by Neil Brown.
>
> * raid10: FUA bit needs to be propagated to write clones.
>
> linear, raid0, 1, 5 and 10 tested.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Neil Brown <[email protected]>
> ---
> Rebased on top of -rc2 and updated as suggested. Can you please
> review and ack it?

Reviewed-by: NeilBrown <[email protected]>

Thanks!

NeilBrown