2010-08-25 15:59:09

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET 2.6.36-rc2] block, fs: replace HARDBARRIER with FLUSH/FUA

Hello,

This patchset is combination of the following three patchset.

[1] block: replace barrier with sequenced flush
[2] block: convert to REQ_FLUSH/FUA
[3] replace barriers with explicit flush / FUA usage

Changes from the previous postings are,

* Rebased on top of 2.6.36-rc2 (502adf5778f4151dcba3f64dd6ed322151f3712c)

* Acked/Reviewed-by's added.

* ide-remove-unnecessary-blk_queue_flushing-test-in-do_ide_request
patch added which removes blk_queue_flushing().

* BH flags, which are no longer necessary on 2.6.36-rc2, are dropped
from fs-block-propagate-REQ_FLUSH-FUA-interface-to-upper-layers and
the patch is collapsed into
block-implement-REQ_FLUSH-FUA-based-interface-for-FLUSH-FUA-requests.

* block-filter-flush-bio-s-in-__generic_make_request added. This
makes sure make_request based drivers which don't implement cache
flushes don't see REQ_FLUSH/FUA requests.

* block-simplify-queue_next_fseq added.

* REQ_FUA support dropped from virtio/lguest conversion as suggested
by Christoph.

* md conversion updated as suggested by Neil Brown.

* dm conversion is excluded for now.

* block-remove-the-BLKDEV_IFL_BARRIER-flag patch now also removes
DISCARD_SECURE.

* block-remove-the-write-barrier-flag patch excluded for now (pending
on dm conversion).

I've audited all make_request drivers and after this patchset only
blktrace, dm, drbd and xen need more work. I'll work on blktrace and
dm but leave xen and drbd for the respective maintainers.

Build tested w/ allmodconfig and lightly tested w/ ext4 and xfs.

This patchst contains the following thirty patches.

0001-ide-remove-unnecessary-blk_queue_flushing-test-in-do.patch
0002-block-loop-queue-ordered-mode-should-be-DRAIN_FLUSH.patch
0003-block-kill-QUEUE_ORDERED_BY_TAG.patch
0004-block-deprecate-barrier-and-replace-blk_queue_ordere.patch
0005-block-remove-spurious-uses-of-REQ_HARDBARRIER.patch
0006-block-misc-cleanups-in-barrier-code.patch
0007-block-drop-barrier-ordering-by-queue-draining.patch
0008-block-rename-blk-barrier.c-to-blk-flush.c.patch
0009-block-rename-barrier-ordered-to-flush.patch
0010-block-implement-REQ_FLUSH-FUA-based-interface-for-FL.patch
0011-block-filter-flush-bio-s-in-__generic_make_request.patch
0012-block-use-REQ_FLUSH-in-blkdev_issue_flush.patch
0013-block-simplify-queue_next_fseq.patch
0014-block-loop-implement-REQ_FLUSH-FUA-support.patch
0015-virtio_blk-drop-REQ_HARDBARRIER-support.patch
0016-lguest-replace-VIRTIO_F_BARRIER-support-with-VIRTIO_.patch
0017-md-implment-REQ_FLUSH-FUA-support.patch
0018-block-pass-gfp_mask-and-flags-to-sb_issue_discard.patch
0019-xfs-replace-barriers-with-explicit-flush-FUA-usage.patch
0020-btrfs-replace-barriers-with-explicit-flush-FUA-usage.patch
0021-gfs2-replace-barriers-with-explicit-flush-FUA-usage.patch
0022-reiserfs-replace-barriers-with-explicit-flush-FUA-us.patch
0023-nilfs2-replace-barriers-with-explicit-flush-FUA-usag.patch
0024-jbd-replace-barriers-with-explicit-flush-FUA-usage.patch
0025-jbd2-replace-barriers-with-explicit-flush-FUA-usage.patch
0026-ext4-do-not-send-discards-as-barriers.patch
0027-fat-do-not-send-discards-as-barriers.patch
0028-swap-do-not-send-discards-as-barriers.patch
0029-block-remove-the-BLKDEV_IFL_BARRIER-flag.patch
0030-block-remove-the-BH_Eopnotsupp-flag.patch

and available in the following git tree.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua

and contain the following changes.

Documentation/lguest/lguest.c | 29 --
block/Makefile | 2
block/blk-barrier.c | 350 ------------------------------------
block/blk-core.c | 68 +++---
block/blk-flush.c | 242 ++++++++++++++++++++++++
block/blk-lib.c | 18 -
block/blk-settings.c | 20 ++
block/blk.h | 8
block/elevator.c | 79 --------
drivers/block/brd.c | 1
drivers/block/loop.c | 20 +-
drivers/block/osdblk.c | 5
drivers/block/pktcdvd.c | 1
drivers/block/ps3disk.c | 2
drivers/block/virtio_blk.c | 37 ---
drivers/block/xen-blkfront.c | 47 +---
drivers/ide/ide-disk.c | 13 -
drivers/ide/ide-io.c | 13 -
drivers/md/dm.c | 2
drivers/md/linear.c | 4
drivers/md/md.c | 117 ++----------
drivers/md/md.h | 23 --
drivers/md/multipath.c | 4
drivers/md/raid0.c | 4
drivers/md/raid1.c | 175 ++++++------------
drivers/md/raid1.h | 2
drivers/md/raid10.c | 7
drivers/md/raid5.c | 43 ++--
drivers/md/raid5.h | 1
drivers/mmc/card/queue.c | 1
drivers/s390/block/dasd.c | 1
drivers/scsi/aic7xxx_old.c | 21 --
drivers/scsi/libsas/sas_scsi_host.c | 13 -
drivers/scsi/sd.c | 18 -
fs/btrfs/disk-io.c | 19 -
fs/btrfs/extent-tree.c | 2
fs/btrfs/volumes.c | 4
fs/btrfs/volumes.h | 1
fs/buffer.c | 7
fs/ext4/mballoc.c | 3
fs/fat/fatent.c | 4
fs/fat/misc.c | 5
fs/gfs2/log.c | 19 -
fs/gfs2/rgrp.c | 5
fs/jbd/commit.c | 30 ---
fs/jbd2/commit.c | 43 ----
fs/nilfs2/super.c | 10 -
fs/nilfs2/the_nilfs.c | 7
fs/reiserfs/journal.c | 106 ++--------
fs/xfs/linux-2.6/xfs_buf.c | 16 -
fs/xfs/linux-2.6/xfs_buf.h | 11 -
fs/xfs/linux-2.6/xfs_trace.h | 1
fs/xfs/xfs_log.c | 13 -
include/linux/blk_types.h | 4
include/linux/blkdev.h | 85 +-------
include/linux/buffer_head.h | 2
include/linux/fs.h | 27 +-
include/scsi/scsi_tcq.h | 6
mm/swapfile.c | 9
59 files changed, 585 insertions(+), 1245 deletions(-)

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel/1022363
[2] http://thread.gmane.org/gmane.linux.raid/29100
[3] http://thread.gmane.org/gmane.linux.file-systems/44957


2010-08-25 15:54:39

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 26/30] ext4: do not send discards as barriers

From: Christoph Hellwig <[email protected]>

ext4 already uses synchronous discards, no need to add I/O barriers.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
fs/ext4/mballoc.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index df44b34..a22bfef 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2567,7 +2567,7 @@ static inline void ext4_issue_discard(struct super_block *sb,
trace_ext4_discard_blocks(sb,
(unsigned long long) discard_block, count);
ret = sb_issue_discard(sb, discard_block, count, GFP_NOFS,
- BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+ BLKDEV_IFL_WAIT);
if (ret == EOPNOTSUPP) {
ext4_warning(sb, "discard not supported, disabling");
clear_opt(EXT4_SB(sb)->s_mount_opt, DISCARD);
--
1.7.1

2010-08-25 15:54:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/30] block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush()

Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
requests. Deprecate barrier. All REQ_HARDBARRIERs are failed with
-EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
blk_queue_flush().

blk_queue_flush() takes combinations of REQ_FLUSH and FUA. If a
device has write cache and can flush it, it should set REQ_FLUSH. If
the device can handle FUA writes, it should also set REQ_FUA.

All blk_queue_ordered() users are converted.

* ORDERED_DRAIN is mapped to 0 which is the default value.
* ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH.
* ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
Cc: Chris Wright <[email protected]>
Cc: FUJITA Tomonori <[email protected]>
Cc: Boaz Harrosh <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Alasdair G Kergon <[email protected]>
Cc: Pierre Ossman <[email protected]>
Cc: Stefan Weinhuber <[email protected]>
---
block/blk-barrier.c | 29 ----------------------------
block/blk-core.c | 6 +++-
block/blk-settings.c | 20 +++++++++++++++++++
drivers/block/brd.c | 1 -
drivers/block/loop.c | 2 +-
drivers/block/osdblk.c | 2 +-
drivers/block/ps3disk.c | 2 +-
drivers/block/virtio_blk.c | 25 ++++++++---------------
drivers/block/xen-blkfront.c | 43 +++++++++++------------------------------
drivers/ide/ide-disk.c | 13 +++++------
drivers/md/dm.c | 2 +-
drivers/mmc/card/queue.c | 1 -
drivers/s390/block/dasd.c | 1 -
drivers/scsi/sd.c | 16 +++++++-------
include/linux/blkdev.h | 6 +++-
15 files changed, 67 insertions(+), 102 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index c807e9c..ed0aba5 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -9,35 +9,6 @@

#include "blk.h"

-/**
- * blk_queue_ordered - does this queue support ordered writes
- * @q: the request queue
- * @ordered: one of QUEUE_ORDERED_*
- *
- * Description:
- * For journalled file systems, doing ordered writes on a commit
- * block instead of explicitly doing wait_on_buffer (which is bad
- * for performance) can be a big win. Block drivers supporting this
- * feature should call this function and indicate so.
- *
- **/
-int blk_queue_ordered(struct request_queue *q, unsigned ordered)
-{
- if (ordered != QUEUE_ORDERED_NONE &&
- ordered != QUEUE_ORDERED_DRAIN &&
- ordered != QUEUE_ORDERED_DRAIN_FLUSH &&
- ordered != QUEUE_ORDERED_DRAIN_FUA) {
- printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
- return -EINVAL;
- }
-
- q->ordered = ordered;
- q->next_ordered = ordered;
-
- return 0;
-}
-EXPORT_SYMBOL(blk_queue_ordered);
-
/*
* Cache flushing for ordered writes handling
*/
diff --git a/block/blk-core.c b/block/blk-core.c
index ee1a1e7..f063541 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1203,11 +1203,13 @@ static int __make_request(struct request_queue *q, struct bio *bio)
const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
int rw_flags;

- if ((bio->bi_rw & REQ_HARDBARRIER) &&
- (q->next_ordered == QUEUE_ORDERED_NONE)) {
+ /* REQ_HARDBARRIER is no more */
+ if (WARN_ONCE(bio->bi_rw & REQ_HARDBARRIER,
+ "block: HARDBARRIER is deprecated, use FLUSH/FUA instead\n")) {
bio_endio(bio, -EOPNOTSUPP);
return 0;
}
+
/*
* low level driver can indicate that it wants pages above a
* certain limit bounced to low memory (ie for highmem, or even
diff --git a/block/blk-settings.c b/block/blk-settings.c
index a234f4b..9b18afc 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -794,6 +794,26 @@ void blk_queue_update_dma_alignment(struct request_queue *q, int mask)
}
EXPORT_SYMBOL(blk_queue_update_dma_alignment);

+/**
+ * blk_queue_flush - configure queue's cache flush capability
+ * @q: the request queue for the device
+ * @flush: 0, REQ_FLUSH or REQ_FLUSH | REQ_FUA
+ *
+ * Tell block layer cache flush capability of @q. If it supports
+ * flushing, REQ_FLUSH should be set. If it supports bypassing
+ * write cache for individual writes, REQ_FUA should be set.
+ */
+void blk_queue_flush(struct request_queue *q, unsigned int flush)
+{
+ WARN_ON_ONCE(flush & ~(REQ_FLUSH | REQ_FUA));
+
+ if (WARN_ON_ONCE(!(flush & REQ_FLUSH) && (flush & REQ_FUA)))
+ flush &= ~REQ_FUA;
+
+ q->flush_flags = flush & (REQ_FLUSH | REQ_FUA);
+}
+EXPORT_SYMBOL_GPL(blk_queue_flush);
+
static int __init blk_settings_init(void)
{
blk_max_low_pfn = max_low_pfn - 1;
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 47a4127..fa33f97 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -482,7 +482,6 @@ static struct brd_device *brd_alloc(int i)
if (!brd->brd_queue)
goto out_free_dev;
blk_queue_make_request(brd->brd_queue, brd_make_request);
- blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_DRAIN);
blk_queue_max_hw_sectors(brd->brd_queue, 1024);
blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c3a4a2e..953d1e1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -832,7 +832,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
lo->lo_queue->unplug_fn = loop_unplug;

if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
- blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN_FLUSH);
+ blk_queue_flush(lo->lo_queue, REQ_FLUSH);

set_capacity(lo->lo_disk, size);
bd_set_size(bdev, size << 9);
diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index 2284b4f..72d6246 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
@@ -439,7 +439,7 @@ static int osdblk_init_disk(struct osdblk_device *osdev)
blk_queue_stack_limits(q, osd_request_queue(osdev->osd));

blk_queue_prep_rq(q, blk_queue_start_tag);
- blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
+ blk_queue_flush(q, REQ_FLUSH);

disk->queue = q;

diff --git a/drivers/block/ps3disk.c b/drivers/block/ps3disk.c
index e9da874..4911f9e 100644
--- a/drivers/block/ps3disk.c
+++ b/drivers/block/ps3disk.c
@@ -468,7 +468,7 @@ static int __devinit ps3disk_probe(struct ps3_system_bus_device *_dev)
blk_queue_dma_alignment(queue, dev->blk_size-1);
blk_queue_logical_block_size(queue, dev->blk_size);

- blk_queue_ordered(queue, QUEUE_ORDERED_DRAIN_FLUSH);
+ blk_queue_flush(queue, REQ_FLUSH);

blk_queue_max_segments(queue, -1);
blk_queue_max_segment_size(queue, dev->bounce_size);
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 7965280..d10b635 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -388,22 +388,15 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
vblk->disk->driverfs_dev = &vdev->dev;
index++;

- if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH)) {
- /*
- * If the FLUSH feature is supported we do have support for
- * flushing a volatile write cache on the host. Use that
- * to implement write barrier support.
- */
- blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
- } else {
- /*
- * If the FLUSH feature is not supported we must assume that
- * the host does not perform any kind of volatile write
- * caching. We still need to drain the queue to provider
- * proper barrier semantics.
- */
- blk_queue_ordered(q, QUEUE_ORDERED_DRAIN);
- }
+ /*
+ * If the FLUSH feature is supported we do have support for
+ * flushing a volatile write cache on the host. Use that to
+ * implement write barrier support; otherwise, we must assume
+ * that the host does not perform any kind of volatile write
+ * caching.
+ */
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
+ blk_queue_flush(q, REQ_FLUSH);

/* If disk is read-only in the host, the guest should obey */
if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 8341862..f2ffc46 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -95,7 +95,7 @@ struct blkfront_info
struct gnttab_free_callback callback;
struct blk_shadow shadow[BLK_RING_SIZE];
unsigned long shadow_free;
- int feature_barrier;
+ unsigned int feature_flush;
int is_ready;
};

@@ -418,25 +418,12 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
}


-static int xlvbd_barrier(struct blkfront_info *info)
+static void xlvbd_flush(struct blkfront_info *info)
{
- int err;
- const char *barrier;
-
- switch (info->feature_barrier) {
- case QUEUE_ORDERED_DRAIN: barrier = "enabled"; break;
- case QUEUE_ORDERED_NONE: barrier = "disabled"; break;
- default: return -EINVAL;
- }
-
- err = blk_queue_ordered(info->rq, info->feature_barrier);
-
- if (err)
- return err;
-
+ blk_queue_flush(info->rq, info->feature_flush);
printk(KERN_INFO "blkfront: %s: barriers %s\n",
- info->gd->disk_name, barrier);
- return 0;
+ info->gd->disk_name,
+ info->feature_flush ? "enabled" : "disabled");
}


@@ -515,7 +502,7 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
info->rq = gd->queue;
info->gd = gd;

- xlvbd_barrier(info);
+ xlvbd_flush(info);

if (vdisk_info & VDISK_READONLY)
set_disk_ro(gd, 1);
@@ -661,8 +648,8 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
printk(KERN_WARNING "blkfront: %s: write barrier op failed\n",
info->gd->disk_name);
error = -EOPNOTSUPP;
- info->feature_barrier = QUEUE_ORDERED_NONE;
- xlvbd_barrier(info);
+ info->feature_flush = 0;
+ xlvbd_flush(info);
}
/* fall through */
case BLKIF_OP_READ:
@@ -1075,19 +1062,13 @@ static void blkfront_connect(struct blkfront_info *info)
/*
* If there's no "feature-barrier" defined, then it means
* we're dealing with a very old backend which writes
- * synchronously; draining will do what needs to get done.
+ * synchronously; nothing to do.
*
* If there are barriers, then we use flush.
- *
- * If barriers are not supported, then there's no much we can
- * do, so just set ordering to NONE.
*/
- if (err)
- info->feature_barrier = QUEUE_ORDERED_DRAIN;
- else if (barrier)
- info->feature_barrier = QUEUE_ORDERED_DRAIN_FLUSH;
- else
- info->feature_barrier = QUEUE_ORDERED_NONE;
+ info->feature_flush = 0;
+ if (!err && barrier)
+ info->feature_flush = REQ_FLUSH;

err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
if (err) {
diff --git a/drivers/ide/ide-disk.c b/drivers/ide/ide-disk.c
index 7433e07..7c5b01c 100644
--- a/drivers/ide/ide-disk.c
+++ b/drivers/ide/ide-disk.c
@@ -516,10 +516,10 @@ static int ide_do_setfeature(ide_drive_t *drive, u8 feature, u8 nsect)
return ide_no_data_taskfile(drive, &cmd);
}

-static void update_ordered(ide_drive_t *drive)
+static void update_flush(ide_drive_t *drive)
{
u16 *id = drive->id;
- unsigned ordered = QUEUE_ORDERED_NONE;
+ unsigned flush = 0;

if (drive->dev_flags & IDE_DFLAG_WCACHE) {
unsigned long long capacity;
@@ -543,13 +543,12 @@ static void update_ordered(ide_drive_t *drive)
drive->name, barrier ? "" : "not ");

if (barrier) {
- ordered = QUEUE_ORDERED_DRAIN_FLUSH;
+ flush = REQ_FLUSH;
blk_queue_prep_rq(drive->queue, idedisk_prep_fn);
}
- } else
- ordered = QUEUE_ORDERED_DRAIN;
+ }

- blk_queue_ordered(drive->queue, ordered);
+ blk_queue_flush(drive->queue, flush);
}

ide_devset_get_flag(wcache, IDE_DFLAG_WCACHE);
@@ -572,7 +571,7 @@ static int set_wcache(ide_drive_t *drive, int arg)
}
}

- update_ordered(drive);
+ update_flush(drive);

return err;
}
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ac384b2..b1d92be 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2245,7 +2245,7 @@ static int dm_init_request_based_queue(struct mapped_device *md)
blk_queue_softirq_done(md->queue, dm_softirq_done);
blk_queue_prep_rq(md->queue, dm_prep_fn);
blk_queue_lld_busy(md->queue, dm_lld_busy);
- blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH);
+ blk_queue_flush(md->queue, REQ_FLUSH);

elv_register_queue(md->queue);

diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index e876678..9c0b42b 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -128,7 +128,6 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
mq->req = NULL;

blk_queue_prep_rq(mq->queue, mmc_prep_request);
- blk_queue_ordered(mq->queue, QUEUE_ORDERED_DRAIN);
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue);
if (mmc_can_erase(card)) {
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mq->queue);
diff --git a/drivers/s390/block/dasd.c b/drivers/s390/block/dasd.c
index 8373ca0..9b106d8 100644
--- a/drivers/s390/block/dasd.c
+++ b/drivers/s390/block/dasd.c
@@ -2197,7 +2197,6 @@ static void dasd_setup_queue(struct dasd_block *block)
*/
blk_queue_max_segment_size(block->request_queue, PAGE_SIZE);
blk_queue_segment_boundary(block->request_queue, PAGE_SIZE - 1);
- blk_queue_ordered(block->request_queue, QUEUE_ORDERED_DRAIN);
}

/*
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index cdfc51a..63bd01a 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2109,7 +2109,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
struct scsi_disk *sdkp = scsi_disk(disk);
struct scsi_device *sdp = sdkp->device;
unsigned char *buffer;
- unsigned ordered;
+ unsigned flush = 0;

SCSI_LOG_HLQUEUE(3, sd_printk(KERN_INFO, sdkp,
"sd_revalidate_disk\n"));
@@ -2151,15 +2151,15 @@ static int sd_revalidate_disk(struct gendisk *disk)

/*
* We now have all cache related info, determine how we deal
- * with ordered requests.
+ * with flush requests.
*/
- if (sdkp->WCE)
- ordered = sdkp->DPOFUA
- ? QUEUE_ORDERED_DRAIN_FUA : QUEUE_ORDERED_DRAIN_FLUSH;
- else
- ordered = QUEUE_ORDERED_DRAIN;
+ if (sdkp->WCE) {
+ flush |= REQ_FLUSH;
+ if (sdkp->DPOFUA)
+ flush |= REQ_FUA;
+ }

- blk_queue_ordered(sdkp->disk->queue, ordered);
+ blk_queue_flush(sdkp->disk->queue, flush);

set_capacity(disk, sdkp->capacity);
kfree(buffer);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7077bc0..e97911d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -355,8 +355,10 @@ struct request_queue
struct blk_trace *blk_trace;
#endif
/*
- * reserved for flush operations
+ * for flush operations
*/
+ unsigned int flush_flags;
+
unsigned int ordered, next_ordered, ordseq;
int orderr, ordcolor;
struct request pre_flush_rq, bar_rq, post_flush_rq;
@@ -865,8 +867,8 @@ extern void blk_queue_update_dma_alignment(struct request_queue *, int);
extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
+extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
-extern int blk_queue_ordered(struct request_queue *, unsigned);
extern bool blk_do_ordered(struct request_queue *, struct request **);
extern unsigned blk_ordered_cur_seq(struct request_queue *);
extern unsigned blk_ordered_req_seq(struct request *);
--
1.7.1

2010-08-25 15:54:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 16/30] lguest: replace VIRTIO_F_BARRIER support with VIRTIO_F_FLUSH support

VIRTIO_F_BARRIER is deprecated. Replace it with VIRTIO_F_FLUSH
support.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
Documentation/lguest/lguest.c | 29 +++++++++--------------------
1 files changed, 9 insertions(+), 20 deletions(-)

diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
index e9ce3c5..fbc64b3 100644
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -1639,15 +1639,6 @@ static void blk_request(struct virtqueue *vq)
off = out->sector * 512;

/*
- * The block device implements "barriers", where the Guest indicates
- * that it wants all previous writes to occur before this write. We
- * don't have a way of asking our kernel to do a barrier, so we just
- * synchronize all the data in the file. Pretty poor, no?
- */
- if (out->type & VIRTIO_BLK_T_BARRIER)
- fdatasync(vblk->fd);
-
- /*
* In general the virtio block driver is allowed to try SCSI commands.
* It'd be nice if we supported eject, for example, but we don't.
*/
@@ -1679,6 +1670,13 @@ static void blk_request(struct virtqueue *vq)
/* Die, bad Guest, die. */
errx(1, "Write past end %llu+%u", off, ret);
}
+
+ wlen = sizeof(*in);
+ *in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
+ } else if (out->type & VIRTIO_BLK_T_FLUSH) {
+ /* Flush */
+ ret = fdatasync(vblk->fd);
+ verbose("FLUSH fdatasync: %i\n", ret);
wlen = sizeof(*in);
*in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
} else {
@@ -1702,15 +1700,6 @@ static void blk_request(struct virtqueue *vq)
}
}

- /*
- * OK, so we noted that it was pretty poor to use an fdatasync as a
- * barrier. But Christoph Hellwig points out that we need a sync
- * *afterwards* as well: "Barriers specify no reordering to the front
- * or the back." And Jens Axboe confirmed it, so here we are:
- */
- if (out->type & VIRTIO_BLK_T_BARRIER)
- fdatasync(vblk->fd);
-
/* Finished that request. */
add_used(vq, head, wlen);
}
@@ -1735,8 +1724,8 @@ static void setup_block_file(const char *filename)
vblk->fd = open_or_die(filename, O_RDWR|O_LARGEFILE);
vblk->len = lseek64(vblk->fd, 0, SEEK_END);

- /* We support barriers. */
- add_feature(dev, VIRTIO_BLK_F_BARRIER);
+ /* We support FLUSH. */
+ add_feature(dev, VIRTIO_BLK_F_FLUSH);

/* Tell Guest how many sectors this device has. */
conf.capacity = cpu_to_le64(vblk->len / 512);
--
1.7.1

2010-08-25 15:54:25

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 19/30] xfs: replace barriers with explicit flush / FUA usage

From: Christoph Hellwig <[email protected]>

Switch to the WRITE_FLUSH_FUA flag for log writes and remove the EOPNOTSUPP
detection for barriers.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
fs/xfs/linux-2.6/xfs_buf.c | 16 ++--------------
fs/xfs/linux-2.6/xfs_buf.h | 11 +----------
fs/xfs/linux-2.6/xfs_trace.h | 1 -
fs/xfs/xfs_log.c | 13 -------------
4 files changed, 3 insertions(+), 38 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index ea79072..b93ea33 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -929,19 +929,7 @@ xfs_buf_iodone_work(
xfs_buf_t *bp =
container_of(work, xfs_buf_t, b_iodone_work);

- /*
- * We can get an EOPNOTSUPP to ordered writes. Here we clear the
- * ordered flag and reissue them. Because we can't tell the higher
- * layers directly that they should not issue ordered I/O anymore, they
- * need to check if the _XFS_BARRIER_FAILED flag was set during I/O completion.
- */
- if ((bp->b_error == EOPNOTSUPP) &&
- (bp->b_flags & (XBF_ORDERED|XBF_ASYNC)) == (XBF_ORDERED|XBF_ASYNC)) {
- trace_xfs_buf_ordered_retry(bp, _RET_IP_);
- bp->b_flags &= ~XBF_ORDERED;
- bp->b_flags |= _XFS_BARRIER_FAILED;
- xfs_buf_iorequest(bp);
- } else if (bp->b_iodone)
+ if (bp->b_iodone)
(*(bp->b_iodone))(bp);
else if (bp->b_flags & XBF_ASYNC)
xfs_buf_relse(bp);
@@ -1200,7 +1188,7 @@ _xfs_buf_ioapply(

if (bp->b_flags & XBF_ORDERED) {
ASSERT(!(bp->b_flags & XBF_READ));
- rw = WRITE_BARRIER;
+ rw = WRITE_FLUSH_FUA;
} else if (bp->b_flags & XBF_LOG_BUFFER) {
ASSERT(!(bp->b_flags & XBF_READ_AHEAD));
bp->b_flags &= ~_XBF_RUN_QUEUES;
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index d072e5f..d533d64 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -86,14 +86,6 @@ typedef enum {
*/
#define _XBF_PAGE_LOCKED (1 << 22)

-/*
- * If we try a barrier write, but it fails we have to communicate
- * this to the upper layers. Unfortunately b_error gets overwritten
- * when the buffer is re-issued so we have to add another flag to
- * keep this information.
- */
-#define _XFS_BARRIER_FAILED (1 << 23)
-
typedef unsigned int xfs_buf_flags_t;

#define XFS_BUF_FLAGS \
@@ -114,8 +106,7 @@ typedef unsigned int xfs_buf_flags_t;
{ _XBF_PAGES, "PAGES" }, \
{ _XBF_RUN_QUEUES, "RUN_QUEUES" }, \
{ _XBF_DELWRI_Q, "DELWRI_Q" }, \
- { _XBF_PAGE_LOCKED, "PAGE_LOCKED" }, \
- { _XFS_BARRIER_FAILED, "BARRIER_FAILED" }
+ { _XBF_PAGE_LOCKED, "PAGE_LOCKED" }


typedef enum {
diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h
index be5dffd..8fe311a 100644
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -325,7 +325,6 @@ DEFINE_BUF_EVENT(xfs_buf_lock);
DEFINE_BUF_EVENT(xfs_buf_lock_done);
DEFINE_BUF_EVENT(xfs_buf_cond_lock);
DEFINE_BUF_EVENT(xfs_buf_unlock);
-DEFINE_BUF_EVENT(xfs_buf_ordered_retry);
DEFINE_BUF_EVENT(xfs_buf_iowait);
DEFINE_BUF_EVENT(xfs_buf_iowait_done);
DEFINE_BUF_EVENT(xfs_buf_delwri_queue);
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 925d572..430a8fc 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -917,19 +917,6 @@ xlog_iodone(xfs_buf_t *bp)
l = iclog->ic_log;

/*
- * If the _XFS_BARRIER_FAILED flag was set by a lower
- * layer, it means the underlying device no longer supports
- * barrier I/O. Warn loudly and turn off barriers.
- */
- if (bp->b_flags & _XFS_BARRIER_FAILED) {
- bp->b_flags &= ~_XFS_BARRIER_FAILED;
- l->l_mp->m_flags &= ~XFS_MOUNT_BARRIER;
- xfs_fs_cmn_err(CE_WARN, l->l_mp,
- "xlog_iodone: Barriers are no longer supported"
- " by device. Disabling barriers\n");
- }
-
- /*
* Race to shutdown the filesystem if we see an error.
*/
if (XFS_TEST_ERROR((XFS_BUF_GETERROR(bp)), l->l_mp,
--
1.7.1

2010-08-25 15:54:34

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 22/30] reiserfs: replace barriers with explicit flush / FUA usage

From: Christoph Hellwig <[email protected]>

Switch to the WRITE_FLUSH_FUA flag for log writes and remove the EOPNOTSUPP
detection for barriers. Note that reiserfs had a fairly different code
path for barriers before as it wa the only filesystem actually making use
of them. The new code always uses the old non-barrier codepath and just
sets the WRITE_FLUSH_FUA explicitly for the journal commits.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Jan Kara <[email protected]>
Acked-by: Chris Mason <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
fs/reiserfs/journal.c | 106 +++++++++---------------------------------------
1 files changed, 20 insertions(+), 86 deletions(-)

diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 812e2c0..076c8b1 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -138,13 +138,6 @@ static int reiserfs_clean_and_file_buffer(struct buffer_head *bh)
return 0;
}

-static void disable_barrier(struct super_block *s)
-{
- REISERFS_SB(s)->s_mount_opt &= ~(1 << REISERFS_BARRIER_FLUSH);
- printk("reiserfs: disabling flush barriers on %s\n",
- reiserfs_bdevname(s));
-}
-
static struct reiserfs_bitmap_node *allocate_bitmap_node(struct super_block
*sb)
{
@@ -677,30 +670,6 @@ static void submit_ordered_buffer(struct buffer_head *bh)
submit_bh(WRITE, bh);
}

-static int submit_barrier_buffer(struct buffer_head *bh)
-{
- get_bh(bh);
- bh->b_end_io = reiserfs_end_ordered_io;
- clear_buffer_dirty(bh);
- if (!buffer_uptodate(bh))
- BUG();
- return submit_bh(WRITE_BARRIER, bh);
-}
-
-static void check_barrier_completion(struct super_block *s,
- struct buffer_head *bh)
-{
- if (buffer_eopnotsupp(bh)) {
- clear_buffer_eopnotsupp(bh);
- disable_barrier(s);
- set_buffer_uptodate(bh);
- set_buffer_dirty(bh);
- reiserfs_write_unlock(s);
- sync_dirty_buffer(bh);
- reiserfs_write_lock(s);
- }
-}
-
#define CHUNK_SIZE 32
struct buffer_chunk {
struct buffer_head *bh[CHUNK_SIZE];
@@ -1009,7 +978,6 @@ static int flush_commit_list(struct super_block *s,
struct buffer_head *tbh = NULL;
unsigned int trans_id = jl->j_trans_id;
struct reiserfs_journal *journal = SB_JOURNAL(s);
- int barrier = 0;
int retval = 0;
int write_len;

@@ -1094,24 +1062,6 @@ static int flush_commit_list(struct super_block *s,
}
atomic_dec(&journal->j_async_throttle);

- /* We're skipping the commit if there's an error */
- if (retval || reiserfs_is_journal_aborted(journal))
- barrier = 0;
-
- /* wait on everything written so far before writing the commit
- * if we are in barrier mode, send the commit down now
- */
- barrier = reiserfs_barrier_flush(s);
- if (barrier) {
- int ret;
- lock_buffer(jl->j_commit_bh);
- ret = submit_barrier_buffer(jl->j_commit_bh);
- if (ret == -EOPNOTSUPP) {
- set_buffer_uptodate(jl->j_commit_bh);
- disable_barrier(s);
- barrier = 0;
- }
- }
for (i = 0; i < (jl->j_len + 1); i++) {
bn = SB_ONDISK_JOURNAL_1st_BLOCK(s) +
(jl->j_start + i) % SB_ONDISK_JOURNAL_SIZE(s);
@@ -1143,27 +1093,22 @@ static int flush_commit_list(struct super_block *s,

BUG_ON(atomic_read(&(jl->j_commit_left)) != 1);

- if (!barrier) {
- /* If there was a write error in the journal - we can't commit
- * this transaction - it will be invalid and, if successful,
- * will just end up propagating the write error out to
- * the file system. */
- if (likely(!retval && !reiserfs_is_journal_aborted (journal))) {
- if (buffer_dirty(jl->j_commit_bh))
- BUG();
- mark_buffer_dirty(jl->j_commit_bh) ;
- reiserfs_write_unlock(s);
- sync_dirty_buffer(jl->j_commit_bh) ;
- reiserfs_write_lock(s);
- }
- } else {
+ /* If there was a write error in the journal - we can't commit
+ * this transaction - it will be invalid and, if successful,
+ * will just end up propagating the write error out to
+ * the file system. */
+ if (likely(!retval && !reiserfs_is_journal_aborted (journal))) {
+ if (buffer_dirty(jl->j_commit_bh))
+ BUG();
+ mark_buffer_dirty(jl->j_commit_bh) ;
reiserfs_write_unlock(s);
- wait_on_buffer(jl->j_commit_bh);
+ if (reiserfs_barrier_flush(s))
+ __sync_dirty_buffer(jl->j_commit_bh, WRITE_FLUSH_FUA);
+ else
+ sync_dirty_buffer(jl->j_commit_bh);
reiserfs_write_lock(s);
}

- check_barrier_completion(s, jl->j_commit_bh);
-
/* If there was a write error in the journal - we can't commit this
* transaction - it will be invalid and, if successful, will just end
* up propagating the write error out to the filesystem. */
@@ -1319,26 +1264,15 @@ static int _update_journal_header_block(struct super_block *sb,
jh->j_first_unflushed_offset = cpu_to_le32(offset);
jh->j_mount_id = cpu_to_le32(journal->j_mount_id);

- if (reiserfs_barrier_flush(sb)) {
- int ret;
- lock_buffer(journal->j_header_bh);
- ret = submit_barrier_buffer(journal->j_header_bh);
- if (ret == -EOPNOTSUPP) {
- set_buffer_uptodate(journal->j_header_bh);
- disable_barrier(sb);
- goto sync;
- }
- reiserfs_write_unlock(sb);
- wait_on_buffer(journal->j_header_bh);
- reiserfs_write_lock(sb);
- check_barrier_completion(sb, journal->j_header_bh);
- } else {
- sync:
- set_buffer_dirty(journal->j_header_bh);
- reiserfs_write_unlock(sb);
+ set_buffer_dirty(journal->j_header_bh);
+ reiserfs_write_unlock(sb);
+
+ if (reiserfs_barrier_flush(sb))
+ __sync_dirty_buffer(journal->j_header_bh, WRITE_FLUSH_FUA);
+ else
sync_dirty_buffer(journal->j_header_bh);
- reiserfs_write_lock(sb);
- }
+
+ reiserfs_write_lock(sb);
if (!buffer_uptodate(journal->j_header_bh)) {
reiserfs_warning(sb, "journal-837",
"IO error during journal replay");
--
1.7.1

2010-08-25 15:54:30

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 18/30] block: pass gfp_mask and flags to sb_issue_discard

From: Christoph Hellwig <[email protected]>

We'll need to get rid of the BLKDEV_IFL_BARRIER flag, and to facilitate
that and to make the interface less confusing pass all flags explicitly.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Mike Snitzer <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
fs/ext4/mballoc.c | 3 ++-
fs/fat/fatent.c | 4 +++-
include/linux/blkdev.h | 11 +++++------
3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 4b4ad4b..df44b34 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2566,7 +2566,8 @@ static inline void ext4_issue_discard(struct super_block *sb,
discard_block = block + ext4_group_first_block_no(sb, block_group);
trace_ext4_discard_blocks(sb,
(unsigned long long) discard_block, count);
- ret = sb_issue_discard(sb, discard_block, count);
+ ret = sb_issue_discard(sb, discard_block, count, GFP_NOFS,
+ BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
if (ret == EOPNOTSUPP) {
ext4_warning(sb, "discard not supported, disabling");
clear_opt(EXT4_SB(sb)->s_mount_opt, DISCARD);
diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c
index 81184d3..3a56a82 100644
--- a/fs/fat/fatent.c
+++ b/fs/fat/fatent.c
@@ -577,7 +577,9 @@ int fat_free_clusters(struct inode *inode, int cluster)

sb_issue_discard(sb,
fat_clus_to_blknr(sbi, first_cl),
- nr_clus * sbi->sec_per_clus);
+ nr_clus * sbi->sec_per_clus,
+ GFP_NOFS,
+ BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);

first_cl = cluster;
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8ef705f..6b305eb 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -881,13 +881,12 @@ extern int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, unsigned long flags);
extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, unsigned long flags);
-static inline int sb_issue_discard(struct super_block *sb,
- sector_t block, sector_t nr_blocks)
+static inline int sb_issue_discard(struct super_block *sb, sector_t block,
+ sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
{
- block <<= (sb->s_blocksize_bits - 9);
- nr_blocks <<= (sb->s_blocksize_bits - 9);
- return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_NOFS,
- BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+ return blkdev_issue_discard(sb->s_bdev, block << (sb->s_blocksize_bits - 9),
+ nr_blocks << (sb->s_blocksize_bits - 9),
+ gfp_mask, flags);
}

extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
--
1.7.1

2010-08-25 15:55:32

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/30] block: remove spurious uses of REQ_HARDBARRIER

REQ_HARDBARRIER is deprecated. Remove spurious uses in the following
users. Please note that other than osdblk, all other uses were
already spurious before deprecation.

* osdblk: osdblk_rq_fn() won't receive any request with
REQ_HARDBARRIER set. Remove the test for it.

* pktcdvd: use of REQ_HARDBARRIER in pkt_generic_packet() doesn't mean
anything. Removed.

* aic7xxx_old: Setting MSG_ORDERED_Q_TAG on REQ_HARDBARRIER is
spurious. Removed.

* sas_scsi_host: Setting TASK_ATTR_ORDERED on REQ_HARDBARRIER is
spurious. Removed.

* scsi_tcq: The ordered tag path wasn't being used anyway. Removed.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Boaz Harrosh <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Peter Osterlund <[email protected]>
---
drivers/block/osdblk.c | 3 +--
drivers/block/pktcdvd.c | 1 -
drivers/scsi/aic7xxx_old.c | 21 ++-------------------
drivers/scsi/libsas/sas_scsi_host.c | 13 +------------
include/scsi/scsi_tcq.h | 6 +-----
5 files changed, 5 insertions(+), 39 deletions(-)

diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index 72d6246..87311eb 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
@@ -310,8 +310,7 @@ static void osdblk_rq_fn(struct request_queue *q)
break;

/* filter out block requests we don't understand */
- if (rq->cmd_type != REQ_TYPE_FS &&
- !(rq->cmd_flags & REQ_HARDBARRIER)) {
+ if (rq->cmd_type != REQ_TYPE_FS) {
blk_end_request_all(rq, 0);
continue;
}
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index b1cbeb5..0166ea1 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -753,7 +753,6 @@ static int pkt_generic_packet(struct pktcdvd_device *pd, struct packet_command *

rq->timeout = 60*HZ;
rq->cmd_type = REQ_TYPE_BLOCK_PC;
- rq->cmd_flags |= REQ_HARDBARRIER;
if (cgc->quiet)
rq->cmd_flags |= REQ_QUIET;

diff --git a/drivers/scsi/aic7xxx_old.c b/drivers/scsi/aic7xxx_old.c
index 93984c9..e1cd606 100644
--- a/drivers/scsi/aic7xxx_old.c
+++ b/drivers/scsi/aic7xxx_old.c
@@ -2850,12 +2850,6 @@ aic7xxx_done(struct aic7xxx_host *p, struct aic7xxx_scb *scb)
aic_dev->r_total++;
ptr = aic_dev->r_bins;
}
- if(cmd->device->simple_tags && cmd->request->cmd_flags & REQ_HARDBARRIER)
- {
- aic_dev->barrier_total++;
- if(scb->tag_action == MSG_ORDERED_Q_TAG)
- aic_dev->ordered_total++;
- }
x = scb->sg_length;
x >>= 10;
for(i=0; i<6; i++)
@@ -10144,19 +10138,8 @@ static void aic7xxx_buildscb(struct aic7xxx_host *p, struct scsi_cmnd *cmd,
/* We always force TEST_UNIT_READY to untagged */
if (cmd->cmnd[0] != TEST_UNIT_READY && sdptr->simple_tags)
{
- if (req->cmd_flags & REQ_HARDBARRIER)
- {
- if(sdptr->ordered_tags)
- {
- hscb->control |= MSG_ORDERED_Q_TAG;
- scb->tag_action = MSG_ORDERED_Q_TAG;
- }
- }
- else
- {
- hscb->control |= MSG_SIMPLE_Q_TAG;
- scb->tag_action = MSG_SIMPLE_Q_TAG;
- }
+ hscb->control |= MSG_SIMPLE_Q_TAG;
+ scb->tag_action = MSG_SIMPLE_Q_TAG;
}
}
if ( !(aic_dev->dtr_pending) &&
diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c
index f0cfba9..535085c 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -130,17 +130,6 @@ static void sas_scsi_task_done(struct sas_task *task)
sc->scsi_done(sc);
}

-static enum task_attribute sas_scsi_get_task_attr(struct scsi_cmnd *cmd)
-{
- enum task_attribute ta = TASK_ATTR_SIMPLE;
- if (cmd->request && blk_rq_tagged(cmd->request)) {
- if (cmd->device->ordered_tags &&
- (cmd->request->cmd_flags & REQ_HARDBARRIER))
- ta = TASK_ATTR_ORDERED;
- }
- return ta;
-}
-
static struct sas_task *sas_create_task(struct scsi_cmnd *cmd,
struct domain_device *dev,
gfp_t gfp_flags)
@@ -160,7 +149,7 @@ static struct sas_task *sas_create_task(struct scsi_cmnd *cmd,
task->ssp_task.retry_count = 1;
int_to_scsilun(cmd->device->lun, &lun);
memcpy(task->ssp_task.LUN, &lun.scsi_lun, 8);
- task->ssp_task.task_attr = sas_scsi_get_task_attr(cmd);
+ task->ssp_task.task_attr = TASK_ATTR_SIMPLE;
memcpy(task->ssp_task.cdb, cmd->cmnd, 16);

task->scatter = scsi_sglist(cmd);
diff --git a/include/scsi/scsi_tcq.h b/include/scsi/scsi_tcq.h
index 1723138..d6e7994 100644
--- a/include/scsi/scsi_tcq.h
+++ b/include/scsi/scsi_tcq.h
@@ -97,13 +97,9 @@ static inline void scsi_deactivate_tcq(struct scsi_device *sdev, int depth)
static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
{
struct request *req = cmd->request;
- struct scsi_device *sdev = cmd->device;

if (blk_rq_tagged(req)) {
- if (sdev->ordered_tags && req->cmd_flags & REQ_HARDBARRIER)
- *msg++ = MSG_ORDERED_TAG;
- else
- *msg++ = MSG_SIMPLE_TAG;
+ *msg++ = MSG_SIMPLE_TAG;
*msg++ = req->tag;
return 2;
}
--
1.7.1

2010-08-25 15:54:13

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 28/30] swap: do not send discards as barriers

From: Christoph Hellwig <[email protected]>

The swap code already uses synchronous discards, no need to add I/O barriers.

tj: superflous newlines removed.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Tested-by: Nigel Cunningham <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
mm/swapfile.c | 9 +++------
1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1f3f9c5..68cda16 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -141,8 +141,7 @@ static int discard_swap(struct swap_info_struct *si)
nr_blocks = ((sector_t)se->nr_pages - 1) << (PAGE_SHIFT - 9);
if (nr_blocks) {
err = blkdev_issue_discard(si->bdev, start_block,
- nr_blocks, GFP_KERNEL,
- BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+ nr_blocks, GFP_KERNEL, BLKDEV_IFL_WAIT);
if (err)
return err;
cond_resched();
@@ -153,8 +152,7 @@ static int discard_swap(struct swap_info_struct *si)
nr_blocks = (sector_t)se->nr_pages << (PAGE_SHIFT - 9);

err = blkdev_issue_discard(si->bdev, start_block,
- nr_blocks, GFP_KERNEL,
- BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+ nr_blocks, GFP_KERNEL, BLKDEV_IFL_WAIT);
if (err)
break;

@@ -193,8 +191,7 @@ static void discard_swap_cluster(struct swap_info_struct *si,
start_block <<= PAGE_SHIFT - 9;
nr_blocks <<= PAGE_SHIFT - 9;
if (blkdev_issue_discard(si->bdev, start_block,
- nr_blocks, GFP_NOIO, BLKDEV_IFL_WAIT |
- BLKDEV_IFL_BARRIER))
+ nr_blocks, GFP_NOIO, BLKDEV_IFL_WAIT))
break;
}

--
1.7.1

2010-08-25 15:54:18

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 23/30] nilfs2: replace barriers with explicit flush / FUA usage

From: Christoph Hellwig <[email protected]>

Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP
detection for barriers and stop setting the barrier flag for discards.

tj: nilfs is now fixed to wait for discard completion. Updated this
patch accordingly and dropped warning about it.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Ryusuke Konishi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
fs/nilfs2/super.c | 10 +---------
fs/nilfs2/the_nilfs.c | 7 ++-----
2 files changed, 3 insertions(+), 14 deletions(-)

diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 9222633..faa5078 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -178,17 +178,9 @@ static int nilfs_sync_super(struct nilfs_sb_info *sbi, int flag)

retry:
set_buffer_dirty(nilfs->ns_sbh[0]);
-
if (nilfs_test_opt(sbi, BARRIER)) {
err = __sync_dirty_buffer(nilfs->ns_sbh[0],
- WRITE_SYNC | WRITE_BARRIER);
- if (err == -EOPNOTSUPP) {
- nilfs_warning(sbi->s_super, __func__,
- "barrier-based sync failed. "
- "disabling barriers\n");
- nilfs_clear_opt(sbi, BARRIER);
- goto retry;
- }
+ WRITE_SYNC | WRITE_FLUSH_FUA);
} else {
err = sync_dirty_buffer(nilfs->ns_sbh[0]);
}
diff --git a/fs/nilfs2/the_nilfs.c b/fs/nilfs2/the_nilfs.c
index 4317f17..400b2ca 100644
--- a/fs/nilfs2/the_nilfs.c
+++ b/fs/nilfs2/the_nilfs.c
@@ -774,9 +774,7 @@ int nilfs_discard_segments(struct the_nilfs *nilfs, __u64 *segnump,
ret = blkdev_issue_discard(nilfs->ns_bdev,
start * sects_per_block,
nblocks * sects_per_block,
- GFP_NOFS,
- BLKDEV_IFL_WAIT |
- BLKDEV_IFL_BARRIER);
+ GFP_NOFS, BLKDEV_IFL_WAIT);
if (ret < 0)
return ret;
nblocks = 0;
@@ -786,8 +784,7 @@ int nilfs_discard_segments(struct the_nilfs *nilfs, __u64 *segnump,
ret = blkdev_issue_discard(nilfs->ns_bdev,
start * sects_per_block,
nblocks * sects_per_block,
- GFP_NOFS,
- BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+ GFP_NOFS, BLKDEV_IFL_WAIT);
return ret;
}

--
1.7.1

2010-08-25 15:56:42

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/30] block: kill QUEUE_ORDERED_BY_TAG

Nobody is making meaningful use of ORDERED_BY_TAG now and queue
draining for barrier requests will be removed soon which will render
the advantage of tag ordering moot. Kill ORDERED_BY_TAG. The
following users are affected.

* brd: converted to ORDERED_DRAIN.
* virtio_blk: ORDERED_TAG path was already marked deprecated. Removed.
* xen-blkfront: ORDERED_TAG case dropped.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
Cc: Chris Wright <[email protected]>
---
block/blk-barrier.c | 35 +++++++----------------------------
drivers/block/brd.c | 2 +-
drivers/block/virtio_blk.c | 9 ---------
drivers/block/xen-blkfront.c | 8 +++-----
drivers/scsi/sd.c | 4 +---
include/linux/blkdev.h | 17 +----------------
6 files changed, 13 insertions(+), 62 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index f0faefc..c807e9c 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -26,10 +26,7 @@ int blk_queue_ordered(struct request_queue *q, unsigned ordered)
if (ordered != QUEUE_ORDERED_NONE &&
ordered != QUEUE_ORDERED_DRAIN &&
ordered != QUEUE_ORDERED_DRAIN_FLUSH &&
- ordered != QUEUE_ORDERED_DRAIN_FUA &&
- ordered != QUEUE_ORDERED_TAG &&
- ordered != QUEUE_ORDERED_TAG_FLUSH &&
- ordered != QUEUE_ORDERED_TAG_FUA) {
+ ordered != QUEUE_ORDERED_DRAIN_FUA) {
printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
return -EINVAL;
}
@@ -155,21 +152,9 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)
* For an empty barrier, there's no actual BAR request, which
* in turn makes POSTFLUSH unnecessary. Mask them off.
*/
- if (!blk_rq_sectors(rq)) {
+ if (!blk_rq_sectors(rq))
q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
QUEUE_ORDERED_DO_POSTFLUSH);
- /*
- * Empty barrier on a write-through device w/ ordered
- * tag has no command to issue and without any command
- * to issue, ordering by tag can't be used. Drain
- * instead.
- */
- if ((q->ordered & QUEUE_ORDERED_BY_TAG) &&
- !(q->ordered & QUEUE_ORDERED_DO_PREFLUSH)) {
- q->ordered &= ~QUEUE_ORDERED_BY_TAG;
- q->ordered |= QUEUE_ORDERED_BY_DRAIN;
- }
- }

/* stash away the original request */
blk_dequeue_request(rq);
@@ -210,7 +195,7 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)
} else
skip |= QUEUE_ORDSEQ_PREFLUSH;

- if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q))
+ if (queue_in_flight(q))
rq = NULL;
else
skip |= QUEUE_ORDSEQ_DRAIN;
@@ -257,16 +242,10 @@ bool blk_do_ordered(struct request_queue *q, struct request **rqp)
rq != &q->pre_flush_rq && rq != &q->post_flush_rq)
return true;

- if (q->ordered & QUEUE_ORDERED_BY_TAG) {
- /* Ordered by tag. Blocking the next barrier is enough. */
- if (is_barrier && rq != &q->bar_rq)
- *rqp = NULL;
- } else {
- /* Ordered by draining. Wait for turn. */
- WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
- if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
- *rqp = NULL;
- }
+ /* Ordered by draining. Wait for turn. */
+ WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
+ if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
+ *rqp = NULL;

return true;
}
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 1c7f637..47a4127 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -482,7 +482,7 @@ static struct brd_device *brd_alloc(int i)
if (!brd->brd_queue)
goto out_free_dev;
blk_queue_make_request(brd->brd_queue, brd_make_request);
- blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_TAG);
+ blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_DRAIN);
blk_queue_max_hw_sectors(brd->brd_queue, 1024);
blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 2aafafc..7965280 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -395,15 +395,6 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
* to implement write barrier support.
*/
blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
- } else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER)) {
- /*
- * If the BARRIER feature is supported the host expects us
- * to order request by tags. This implies there is not
- * volatile write cache on the host, and that the host
- * never re-orders outstanding I/O. This feature is not
- * useful for real life scenarious and deprecated.
- */
- blk_queue_ordered(q, QUEUE_ORDERED_TAG);
} else {
/*
* If the FLUSH feature is not supported we must assume that
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index ab735a6..8341862 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -424,8 +424,7 @@ static int xlvbd_barrier(struct blkfront_info *info)
const char *barrier;

switch (info->feature_barrier) {
- case QUEUE_ORDERED_DRAIN: barrier = "enabled (drain)"; break;
- case QUEUE_ORDERED_TAG: barrier = "enabled (tag)"; break;
+ case QUEUE_ORDERED_DRAIN: barrier = "enabled"; break;
case QUEUE_ORDERED_NONE: barrier = "disabled"; break;
default: return -EINVAL;
}
@@ -1078,8 +1077,7 @@ static void blkfront_connect(struct blkfront_info *info)
* we're dealing with a very old backend which writes
* synchronously; draining will do what needs to get done.
*
- * If there are barriers, then we can do full queued writes
- * with tagged barriers.
+ * If there are barriers, then we use flush.
*
* If barriers are not supported, then there's no much we can
* do, so just set ordering to NONE.
@@ -1087,7 +1085,7 @@ static void blkfront_connect(struct blkfront_info *info)
if (err)
info->feature_barrier = QUEUE_ORDERED_DRAIN;
else if (barrier)
- info->feature_barrier = QUEUE_ORDERED_TAG;
+ info->feature_barrier = QUEUE_ORDERED_DRAIN_FLUSH;
else
info->feature_barrier = QUEUE_ORDERED_NONE;

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 2714bec..cdfc51a 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2151,9 +2151,7 @@ static int sd_revalidate_disk(struct gendisk *disk)

/*
* We now have all cache related info, determine how we deal
- * with ordered requests. Note that as the current SCSI
- * dispatch function can alter request order, we cannot use
- * QUEUE_ORDERED_TAG_* even when ordered tag is supported.
+ * with ordered requests.
*/
if (sdkp->WCE)
ordered = sdkp->DPOFUA
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 015375c..7077bc0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -470,12 +470,7 @@ enum {
* DRAIN : ordering by draining is enough
* DRAIN_FLUSH : ordering by draining w/ pre and post flushes
* DRAIN_FUA : ordering by draining w/ pre flush and FUA write
- * TAG : ordering by tag is enough
- * TAG_FLUSH : ordering by tag w/ pre and post flushes
- * TAG_FUA : ordering by tag w/ pre flush and FUA write
*/
- QUEUE_ORDERED_BY_DRAIN = 0x01,
- QUEUE_ORDERED_BY_TAG = 0x02,
QUEUE_ORDERED_DO_PREFLUSH = 0x10,
QUEUE_ORDERED_DO_BAR = 0x20,
QUEUE_ORDERED_DO_POSTFLUSH = 0x40,
@@ -483,8 +478,7 @@ enum {

QUEUE_ORDERED_NONE = 0x00,

- QUEUE_ORDERED_DRAIN = QUEUE_ORDERED_BY_DRAIN |
- QUEUE_ORDERED_DO_BAR,
+ QUEUE_ORDERED_DRAIN = QUEUE_ORDERED_DO_BAR,
QUEUE_ORDERED_DRAIN_FLUSH = QUEUE_ORDERED_DRAIN |
QUEUE_ORDERED_DO_PREFLUSH |
QUEUE_ORDERED_DO_POSTFLUSH,
@@ -492,15 +486,6 @@ enum {
QUEUE_ORDERED_DO_PREFLUSH |
QUEUE_ORDERED_DO_FUA,

- QUEUE_ORDERED_TAG = QUEUE_ORDERED_BY_TAG |
- QUEUE_ORDERED_DO_BAR,
- QUEUE_ORDERED_TAG_FLUSH = QUEUE_ORDERED_TAG |
- QUEUE_ORDERED_DO_PREFLUSH |
- QUEUE_ORDERED_DO_POSTFLUSH,
- QUEUE_ORDERED_TAG_FUA = QUEUE_ORDERED_TAG |
- QUEUE_ORDERED_DO_PREFLUSH |
- QUEUE_ORDERED_DO_FUA,
-
/*
* Ordered operation sequence
*/
--
1.7.1

2010-08-25 15:57:11

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 17/30] md: implment REQ_FLUSH/FUA support

This patch converts md to support REQ_FLUSH/FUA instead of now
deprecated REQ_HARDBARRIER. In the core part (md.c), the following
changes are notable.

* Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
processing of other requests and thus there is no reason to mark the
queue congested while FLUSH/FUA is in progress.

* REQ_FLUSH/FUA failures are final and its users don't need retry
logic. Retry logic is removed.

* Preflush needs to be issued to all member devices but FUA writes can
be handled the same way as other writes - their processing can be
deferred to request_queue of member devices. md_barrier_request()
is renamed to md_flush_request() and simplified accordingly.

For linear, raid0 and multipath, the core changes are enough. raid1,
5 and 10 need the following conversions.

* raid1: Handling of FLUSH/FUA bio's can simply be deferred to
request_queues of member devices. Barrier related logic removed.

* raid5: Queue draining logic dropped. FUA bit is propagated through
biodrain and stripe resconstruction such that all the updated parts
of the stripe are written out with FUA writes if any of the dirtying
writes was FUA. preread_active_stripes handling in make_request()
is updated as suggested by Neil Brown.

* raid10: FUA bit needs to be propagated to write clones.

linear, raid0, 1, 5 and 10 tested.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Neil Brown <[email protected]>
---
drivers/md/linear.c | 4 +-
drivers/md/md.c | 117 +++++++-------------------------
drivers/md/md.h | 23 ++-----
drivers/md/multipath.c | 4 +-
drivers/md/raid0.c | 4 +-
drivers/md/raid1.c | 175 ++++++++++++++++--------------------------------
drivers/md/raid1.h | 2 -
drivers/md/raid10.c | 7 +-
drivers/md/raid5.c | 43 ++++++------
drivers/md/raid5.h | 1 +
10 files changed, 122 insertions(+), 258 deletions(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index ba19060..8a2f767 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -294,8 +294,8 @@ static int linear_make_request (mddev_t *mddev, struct bio *bio)
dev_info_t *tmp_dev;
sector_t start_sector;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c148b63..3640f02 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -226,12 +226,12 @@ static int md_make_request(struct request_queue *q, struct bio *bio)
return 0;
}
rcu_read_lock();
- if (mddev->suspended || mddev->barrier) {
+ if (mddev->suspended) {
DEFINE_WAIT(__wait);
for (;;) {
prepare_to_wait(&mddev->sb_wait, &__wait,
TASK_UNINTERRUPTIBLE);
- if (!mddev->suspended && !mddev->barrier)
+ if (!mddev->suspended)
break;
rcu_read_unlock();
schedule();
@@ -282,40 +282,29 @@ EXPORT_SYMBOL_GPL(mddev_resume);

int mddev_congested(mddev_t *mddev, int bits)
{
- if (mddev->barrier)
- return 1;
return mddev->suspended;
}
EXPORT_SYMBOL(mddev_congested);

/*
- * Generic barrier handling for md
+ * Generic flush handling for md
*/

-#define POST_REQUEST_BARRIER ((void*)1)
-
-static void md_end_barrier(struct bio *bio, int err)
+static void md_end_flush(struct bio *bio, int err)
{
mdk_rdev_t *rdev = bio->bi_private;
mddev_t *mddev = rdev->mddev;
- if (err == -EOPNOTSUPP && mddev->barrier != POST_REQUEST_BARRIER)
- set_bit(BIO_EOPNOTSUPP, &mddev->barrier->bi_flags);

rdev_dec_pending(rdev, mddev);

if (atomic_dec_and_test(&mddev->flush_pending)) {
- if (mddev->barrier == POST_REQUEST_BARRIER) {
- /* This was a post-request barrier */
- mddev->barrier = NULL;
- wake_up(&mddev->sb_wait);
- } else
- /* The pre-request barrier has finished */
- schedule_work(&mddev->barrier_work);
+ /* The pre-request flush has finished */
+ schedule_work(&mddev->flush_work);
}
bio_put(bio);
}

-static void submit_barriers(mddev_t *mddev)
+static void submit_flushes(mddev_t *mddev)
{
mdk_rdev_t *rdev;

@@ -332,60 +321,56 @@ static void submit_barriers(mddev_t *mddev)
atomic_inc(&rdev->nr_pending);
rcu_read_unlock();
bi = bio_alloc(GFP_KERNEL, 0);
- bi->bi_end_io = md_end_barrier;
+ bi->bi_end_io = md_end_flush;
bi->bi_private = rdev;
bi->bi_bdev = rdev->bdev;
atomic_inc(&mddev->flush_pending);
- submit_bio(WRITE_BARRIER, bi);
+ submit_bio(WRITE_FLUSH, bi);
rcu_read_lock();
rdev_dec_pending(rdev, mddev);
}
rcu_read_unlock();
}

-static void md_submit_barrier(struct work_struct *ws)
+static void md_submit_flush_data(struct work_struct *ws)
{
- mddev_t *mddev = container_of(ws, mddev_t, barrier_work);
- struct bio *bio = mddev->barrier;
+ mddev_t *mddev = container_of(ws, mddev_t, flush_work);
+ struct bio *bio = mddev->flush_bio;

atomic_set(&mddev->flush_pending, 1);

- if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags))
- bio_endio(bio, -EOPNOTSUPP);
- else if (bio->bi_size == 0)
+ if (bio->bi_size == 0)
/* an empty barrier - all done */
bio_endio(bio, 0);
else {
- bio->bi_rw &= ~REQ_HARDBARRIER;
+ bio->bi_rw &= ~REQ_FLUSH;
if (mddev->pers->make_request(mddev, bio))
generic_make_request(bio);
- mddev->barrier = POST_REQUEST_BARRIER;
- submit_barriers(mddev);
}
if (atomic_dec_and_test(&mddev->flush_pending)) {
- mddev->barrier = NULL;
+ mddev->flush_bio = NULL;
wake_up(&mddev->sb_wait);
}
}

-void md_barrier_request(mddev_t *mddev, struct bio *bio)
+void md_flush_request(mddev_t *mddev, struct bio *bio)
{
spin_lock_irq(&mddev->write_lock);
wait_event_lock_irq(mddev->sb_wait,
- !mddev->barrier,
+ !mddev->flush_bio,
mddev->write_lock, /*nothing*/);
- mddev->barrier = bio;
+ mddev->flush_bio = bio;
spin_unlock_irq(&mddev->write_lock);

atomic_set(&mddev->flush_pending, 1);
- INIT_WORK(&mddev->barrier_work, md_submit_barrier);
+ INIT_WORK(&mddev->flush_work, md_submit_flush_data);

- submit_barriers(mddev);
+ submit_flushes(mddev);

if (atomic_dec_and_test(&mddev->flush_pending))
- schedule_work(&mddev->barrier_work);
+ schedule_work(&mddev->flush_work);
}
-EXPORT_SYMBOL(md_barrier_request);
+EXPORT_SYMBOL(md_flush_request);

/* Support for plugging.
* This mirrors the plugging support in request_queue, but does not
@@ -696,31 +681,6 @@ static void super_written(struct bio *bio, int error)
bio_put(bio);
}

-static void super_written_barrier(struct bio *bio, int error)
-{
- struct bio *bio2 = bio->bi_private;
- mdk_rdev_t *rdev = bio2->bi_private;
- mddev_t *mddev = rdev->mddev;
-
- if (!test_bit(BIO_UPTODATE, &bio->bi_flags) &&
- error == -EOPNOTSUPP) {
- unsigned long flags;
- /* barriers don't appear to be supported :-( */
- set_bit(BarriersNotsupp, &rdev->flags);
- mddev->barriers_work = 0;
- spin_lock_irqsave(&mddev->write_lock, flags);
- bio2->bi_next = mddev->biolist;
- mddev->biolist = bio2;
- spin_unlock_irqrestore(&mddev->write_lock, flags);
- wake_up(&mddev->sb_wait);
- bio_put(bio);
- } else {
- bio_put(bio2);
- bio->bi_private = rdev;
- super_written(bio, error);
- }
-}
-
void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
sector_t sector, int size, struct page *page)
{
@@ -729,51 +689,28 @@ void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
* and decrement it on completion, waking up sb_wait
* if zero is reached.
* If an error occurred, call md_error
- *
- * As we might need to resubmit the request if REQ_HARDBARRIER
- * causes ENOTSUPP, we allocate a spare bio...
*/
struct bio *bio = bio_alloc(GFP_NOIO, 1);
- int rw = REQ_WRITE | REQ_SYNC | REQ_UNPLUG;

bio->bi_bdev = rdev->bdev;
bio->bi_sector = sector;
bio_add_page(bio, page, size, 0);
bio->bi_private = rdev;
bio->bi_end_io = super_written;
- bio->bi_rw = rw;

atomic_inc(&mddev->pending_writes);
- if (!test_bit(BarriersNotsupp, &rdev->flags)) {
- struct bio *rbio;
- rw |= REQ_HARDBARRIER;
- rbio = bio_clone(bio, GFP_NOIO);
- rbio->bi_private = bio;
- rbio->bi_end_io = super_written_barrier;
- submit_bio(rw, rbio);
- } else
- submit_bio(rw, bio);
+ submit_bio(REQ_WRITE | REQ_SYNC | REQ_UNPLUG | REQ_FLUSH | REQ_FUA,
+ bio);
}

void md_super_wait(mddev_t *mddev)
{
- /* wait for all superblock writes that were scheduled to complete.
- * if any had to be retried (due to BARRIER problems), retry them
- */
+ /* wait for all superblock writes that were scheduled to complete */
DEFINE_WAIT(wq);
for(;;) {
prepare_to_wait(&mddev->sb_wait, &wq, TASK_UNINTERRUPTIBLE);
if (atomic_read(&mddev->pending_writes)==0)
break;
- while (mddev->biolist) {
- struct bio *bio;
- spin_lock_irq(&mddev->write_lock);
- bio = mddev->biolist;
- mddev->biolist = bio->bi_next ;
- bio->bi_next = NULL;
- spin_unlock_irq(&mddev->write_lock);
- submit_bio(bio->bi_rw, bio);
- }
schedule();
}
finish_wait(&mddev->sb_wait, &wq);
@@ -1070,7 +1007,6 @@ static int super_90_validate(mddev_t *mddev, mdk_rdev_t *rdev)
clear_bit(Faulty, &rdev->flags);
clear_bit(In_sync, &rdev->flags);
clear_bit(WriteMostly, &rdev->flags);
- clear_bit(BarriersNotsupp, &rdev->flags);

if (mddev->raid_disks == 0) {
mddev->major_version = 0;
@@ -1485,7 +1421,6 @@ static int super_1_validate(mddev_t *mddev, mdk_rdev_t *rdev)
clear_bit(Faulty, &rdev->flags);
clear_bit(In_sync, &rdev->flags);
clear_bit(WriteMostly, &rdev->flags);
- clear_bit(BarriersNotsupp, &rdev->flags);

if (mddev->raid_disks == 0) {
mddev->major_version = 1;
@@ -4506,7 +4441,6 @@ int md_run(mddev_t *mddev)
/* may be over-ridden by personality */
mddev->resync_max_sectors = mddev->dev_sectors;

- mddev->barriers_work = 1;
mddev->ok_start_degraded = start_dirty_degraded;

if (start_readonly && mddev->ro == 0)
@@ -4685,7 +4619,6 @@ static void md_clean(mddev_t *mddev)
mddev->recovery = 0;
mddev->in_sync = 0;
mddev->degraded = 0;
- mddev->barriers_work = 0;
mddev->safemode = 0;
mddev->bitmap_info.offset = 0;
mddev->bitmap_info.default_offset = 0;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index a953fe2..d8e2ab2 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -87,7 +87,6 @@ struct mdk_rdev_s
#define Faulty 1 /* device is known to have a fault */
#define In_sync 2 /* device is in_sync with rest of array */
#define WriteMostly 4 /* Avoid reading if at all possible */
-#define BarriersNotsupp 5 /* REQ_HARDBARRIER is not supported */
#define AllReserved 6 /* If whole device is reserved for
* one array */
#define AutoDetected 7 /* added by auto-detect */
@@ -273,13 +272,6 @@ struct mddev_s
int degraded; /* whether md should consider
* adding a spare
*/
- int barriers_work; /* initialised to true, cleared as soon
- * as a barrier request to slave
- * fails. Only supported
- */
- struct bio *biolist; /* bios that need to be retried
- * because REQ_HARDBARRIER is not supported
- */

atomic_t recovery_active; /* blocks scheduled, but not written */
wait_queue_head_t recovery_wait;
@@ -339,16 +331,13 @@ struct mddev_s
struct attribute_group *to_remove;
struct plug_handle *plug; /* if used by personality */

- /* Generic barrier handling.
- * If there is a pending barrier request, all other
- * writes are blocked while the devices are flushed.
- * The last to finish a flush schedules a worker to
- * submit the barrier request (without the barrier flag),
- * then submit more flush requests.
+ /* Generic flush handling.
+ * The last to finish preflush schedules a worker to submit
+ * the rest of the request (without the REQ_FLUSH flag).
*/
- struct bio *barrier;
+ struct bio *flush_bio;
atomic_t flush_pending;
- struct work_struct barrier_work;
+ struct work_struct flush_work;
struct work_struct event_work; /* used by dm to report failure event */
};

@@ -502,7 +491,7 @@ extern void md_done_sync(mddev_t *mddev, int blocks, int ok);
extern void md_error(mddev_t *mddev, mdk_rdev_t *rdev);

extern int mddev_congested(mddev_t *mddev, int bits);
-extern void md_barrier_request(mddev_t *mddev, struct bio *bio);
+extern void md_flush_request(mddev_t *mddev, struct bio *bio);
extern void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
sector_t sector, int size, struct page *page);
extern void md_super_wait(mddev_t *mddev);
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 0307d21..6d7ddf3 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -142,8 +142,8 @@ static int multipath_make_request(mddev_t *mddev, struct bio * bio)
struct multipath_bh * mp_bh;
struct multipath_info *multipath;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 6f7af46..a39f4c3 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -483,8 +483,8 @@ static int raid0_make_request(mddev_t *mddev, struct bio *bio)
struct strip_zone *zone;
mdk_rdev_t *tmp_dev;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index ad83a4d..3f97bea 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -319,83 +319,74 @@ static void raid1_end_write_request(struct bio *bio, int error)
if (r1_bio->bios[mirror] == bio)
break;

- if (error == -EOPNOTSUPP && test_bit(R1BIO_Barrier, &r1_bio->state)) {
- set_bit(BarriersNotsupp, &conf->mirrors[mirror].rdev->flags);
- set_bit(R1BIO_BarrierRetry, &r1_bio->state);
- r1_bio->mddev->barriers_work = 0;
- /* Don't rdev_dec_pending in this branch - keep it for the retry */
- } else {
+ /*
+ * 'one mirror IO has finished' event handler:
+ */
+ r1_bio->bios[mirror] = NULL;
+ to_put = bio;
+ if (!uptodate) {
+ md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
+ /* an I/O failed, we can't clear the bitmap */
+ set_bit(R1BIO_Degraded, &r1_bio->state);
+ } else
/*
- * this branch is our 'one mirror IO has finished' event handler:
+ * Set R1BIO_Uptodate in our master bio, so that we
+ * will return a good error code for to the higher
+ * levels even if IO on some other mirrored buffer
+ * fails.
+ *
+ * The 'master' represents the composite IO operation
+ * to user-side. So if something waits for IO, then it
+ * will wait for the 'master' bio.
*/
- r1_bio->bios[mirror] = NULL;
- to_put = bio;
- if (!uptodate) {
- md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
- /* an I/O failed, we can't clear the bitmap */
- set_bit(R1BIO_Degraded, &r1_bio->state);
- } else
- /*
- * Set R1BIO_Uptodate in our master bio, so that
- * we will return a good error code for to the higher
- * levels even if IO on some other mirrored buffer fails.
- *
- * The 'master' represents the composite IO operation to
- * user-side. So if something waits for IO, then it will
- * wait for the 'master' bio.
- */
- set_bit(R1BIO_Uptodate, &r1_bio->state);
-
- update_head_pos(mirror, r1_bio);
-
- if (behind) {
- if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
- atomic_dec(&r1_bio->behind_remaining);
-
- /* In behind mode, we ACK the master bio once the I/O has safely
- * reached all non-writemostly disks. Setting the Returned bit
- * ensures that this gets done only once -- we don't ever want to
- * return -EIO here, instead we'll wait */
-
- if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
- test_bit(R1BIO_Uptodate, &r1_bio->state)) {
- /* Maybe we can return now */
- if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
- struct bio *mbio = r1_bio->master_bio;
- PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu\n",
- (unsigned long long) mbio->bi_sector,
- (unsigned long long) mbio->bi_sector +
- (mbio->bi_size >> 9) - 1);
- bio_endio(mbio, 0);
- }
+ set_bit(R1BIO_Uptodate, &r1_bio->state);
+
+ update_head_pos(mirror, r1_bio);
+
+ if (behind) {
+ if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
+ atomic_dec(&r1_bio->behind_remaining);
+
+ /*
+ * In behind mode, we ACK the master bio once the I/O
+ * has safely reached all non-writemostly
+ * disks. Setting the Returned bit ensures that this
+ * gets done only once -- we don't ever want to return
+ * -EIO here, instead we'll wait
+ */
+ if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
+ test_bit(R1BIO_Uptodate, &r1_bio->state)) {
+ /* Maybe we can return now */
+ if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
+ struct bio *mbio = r1_bio->master_bio;
+ PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu\n",
+ (unsigned long long) mbio->bi_sector,
+ (unsigned long long) mbio->bi_sector +
+ (mbio->bi_size >> 9) - 1);
+ bio_endio(mbio, 0);
}
}
- rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
}
+ rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
+
/*
- *
* Let's see if all mirrored write operations have finished
* already.
*/
if (atomic_dec_and_test(&r1_bio->remaining)) {
- if (test_bit(R1BIO_BarrierRetry, &r1_bio->state))
- reschedule_retry(r1_bio);
- else {
- /* it really is the end of this request */
- if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
- /* free extra copy of the data pages */
- int i = bio->bi_vcnt;
- while (i--)
- safe_put_page(bio->bi_io_vec[i].bv_page);
- }
- /* clear the bitmap if all writes complete successfully */
- bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
- r1_bio->sectors,
- !test_bit(R1BIO_Degraded, &r1_bio->state),
- behind);
- md_write_end(r1_bio->mddev);
- raid_end_bio_io(r1_bio);
+ if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
+ /* free extra copy of the data pages */
+ int i = bio->bi_vcnt;
+ while (i--)
+ safe_put_page(bio->bi_io_vec[i].bv_page);
}
+ /* clear the bitmap if all writes complete successfully */
+ bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
+ r1_bio->sectors,
+ !test_bit(R1BIO_Degraded, &r1_bio->state),
+ behind);
+ md_write_end(r1_bio->mddev);
+ raid_end_bio_io(r1_bio);
}

if (to_put)
@@ -788,6 +779,7 @@ static int make_request(mddev_t *mddev, struct bio * bio)
struct page **behind_pages = NULL;
const int rw = bio_data_dir(bio);
const unsigned long do_sync = (bio->bi_rw & REQ_SYNC);
+ const unsigned long do_flush_fua = (bio->bi_rw & (REQ_FLUSH | REQ_FUA));
unsigned long do_barriers;
mdk_rdev_t *blocked_rdev;

@@ -795,9 +787,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
* Register the new request and wait if the reconstruction
* thread has put up a bar for new requests.
* Continue immediately if no resync is active currently.
- * We test barriers_work *after* md_write_start as md_write_start
- * may cause the first superblock write, and that will check out
- * if barriers work.
*/

md_write_start(mddev, bio); /* wait on superblock update early */
@@ -821,13 +810,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
}
finish_wait(&conf->wait_barrier, &w);
}
- if (unlikely(!mddev->barriers_work &&
- (bio->bi_rw & REQ_HARDBARRIER))) {
- if (rw == WRITE)
- md_write_end(mddev);
- bio_endio(bio, -EOPNOTSUPP);
- return 0;
- }

wait_barrier(conf);

@@ -959,10 +941,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
atomic_set(&r1_bio->remaining, 0);
atomic_set(&r1_bio->behind_remaining, 0);

- do_barriers = bio->bi_rw & REQ_HARDBARRIER;
- if (do_barriers)
- set_bit(R1BIO_Barrier, &r1_bio->state);
-
bio_list_init(&bl);
for (i = 0; i < disks; i++) {
struct bio *mbio;
@@ -975,7 +953,7 @@ static int make_request(mddev_t *mddev, struct bio * bio)
mbio->bi_sector = r1_bio->sector + conf->mirrors[i].rdev->data_offset;
mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
mbio->bi_end_io = raid1_end_write_request;
- mbio->bi_rw = WRITE | do_barriers | do_sync;
+ mbio->bi_rw = WRITE | do_flush_fua | do_sync;
mbio->bi_private = r1_bio;

if (behind_pages) {
@@ -1634,41 +1612,6 @@ static void raid1d(mddev_t *mddev)
if (test_bit(R1BIO_IsSync, &r1_bio->state)) {
sync_request_write(mddev, r1_bio);
unplug = 1;
- } else if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) {
- /* some requests in the r1bio were REQ_HARDBARRIER
- * requests which failed with -EOPNOTSUPP. Hohumm..
- * Better resubmit without the barrier.
- * We know which devices to resubmit for, because
- * all others have had their bios[] entry cleared.
- * We already have a nr_pending reference on these rdevs.
- */
- int i;
- const unsigned long do_sync = (r1_bio->master_bio->bi_rw & REQ_SYNC);
- clear_bit(R1BIO_BarrierRetry, &r1_bio->state);
- clear_bit(R1BIO_Barrier, &r1_bio->state);
- for (i=0; i < conf->raid_disks; i++)
- if (r1_bio->bios[i])
- atomic_inc(&r1_bio->remaining);
- for (i=0; i < conf->raid_disks; i++)
- if (r1_bio->bios[i]) {
- struct bio_vec *bvec;
- int j;
-
- bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
- /* copy pages from the failed bio, as
- * this might be a write-behind device */
- __bio_for_each_segment(bvec, bio, j, 0)
- bvec->bv_page = bio_iovec_idx(r1_bio->bios[i], j)->bv_page;
- bio_put(r1_bio->bios[i]);
- bio->bi_sector = r1_bio->sector +
- conf->mirrors[i].rdev->data_offset;
- bio->bi_bdev = conf->mirrors[i].rdev->bdev;
- bio->bi_end_io = raid1_end_write_request;
- bio->bi_rw = WRITE | do_sync;
- bio->bi_private = r1_bio;
- r1_bio->bios[i] = bio;
- generic_make_request(bio);
- }
} else {
int disk;

diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index 5f2d443..adf8cfd 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -117,8 +117,6 @@ struct r1bio_s {
#define R1BIO_IsSync 1
#define R1BIO_Degraded 2
#define R1BIO_BehindIO 3
-#define R1BIO_Barrier 4
-#define R1BIO_BarrierRetry 5
/* For write-behind requests, we call bi_end_io when
* the last non-write-behind device completes, providing
* any write was successful. Otherwise we call when
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 8471838..f0d082f 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -800,12 +800,13 @@ static int make_request(mddev_t *mddev, struct bio * bio)
int chunk_sects = conf->chunk_mask + 1;
const int rw = bio_data_dir(bio);
const unsigned long do_sync = (bio->bi_rw & REQ_SYNC);
+ const unsigned long do_fua = (bio->bi_rw & REQ_FUA);
struct bio_list bl;
unsigned long flags;
mdk_rdev_t *blocked_rdev;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

@@ -965,7 +966,7 @@ static int make_request(mddev_t *mddev, struct bio * bio)
conf->mirrors[d].rdev->data_offset;
mbio->bi_bdev = conf->mirrors[d].rdev->bdev;
mbio->bi_end_io = raid10_end_write_request;
- mbio->bi_rw = WRITE | do_sync;
+ mbio->bi_rw = WRITE | do_sync | do_fua;
mbio->bi_private = r10_bio;

atomic_inc(&r10_bio->remaining);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 69b0a16..31140d1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -506,9 +506,12 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
int rw;
struct bio *bi;
mdk_rdev_t *rdev;
- if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags))
- rw = WRITE;
- else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
+ if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags)) {
+ if (test_and_clear_bit(R5_WantFUA, &sh->dev[i].flags))
+ rw = WRITE_FUA;
+ else
+ rw = WRITE;
+ } else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
rw = READ;
else
continue;
@@ -1031,6 +1034,8 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)

while (wbi && wbi->bi_sector <
dev->sector + STRIPE_SECTORS) {
+ if (wbi->bi_rw & REQ_FUA)
+ set_bit(R5_WantFUA, &dev->flags);
tx = async_copy_data(1, wbi, dev->page,
dev->sector, tx);
wbi = r5_next_bio(wbi, dev->sector);
@@ -1048,15 +1053,22 @@ static void ops_complete_reconstruct(void *stripe_head_ref)
int pd_idx = sh->pd_idx;
int qd_idx = sh->qd_idx;
int i;
+ bool fua = false;

pr_debug("%s: stripe %llu\n", __func__,
(unsigned long long)sh->sector);

+ for (i = disks; i--; )
+ fua |= test_bit(R5_WantFUA, &sh->dev[i].flags);
+
for (i = disks; i--; ) {
struct r5dev *dev = &sh->dev[i];

- if (dev->written || i == pd_idx || i == qd_idx)
+ if (dev->written || i == pd_idx || i == qd_idx) {
set_bit(R5_UPTODATE, &dev->flags);
+ if (fua)
+ set_bit(R5_WantFUA, &dev->flags);
+ }
}

if (sh->reconstruct_state == reconstruct_state_drain_run)
@@ -3281,7 +3293,7 @@ static void handle_stripe5(struct stripe_head *sh)

if (dec_preread_active) {
/* We delay this until after ops_run_io so that if make_request
- * is waiting on a barrier, it won't continue until the writes
+ * is waiting on a flush, it won't continue until the writes
* have actually been submitted.
*/
atomic_dec(&conf->preread_active_stripes);
@@ -3583,7 +3595,7 @@ static void handle_stripe6(struct stripe_head *sh)

if (dec_preread_active) {
/* We delay this until after ops_run_io so that if make_request
- * is waiting on a barrier, it won't continue until the writes
+ * is waiting on a flush, it won't continue until the writes
* have actually been submitted.
*/
atomic_dec(&conf->preread_active_stripes);
@@ -3978,14 +3990,8 @@ static int make_request(mddev_t *mddev, struct bio * bi)
const int rw = bio_data_dir(bi);
int remaining;

- if (unlikely(bi->bi_rw & REQ_HARDBARRIER)) {
- /* Drain all pending writes. We only really need
- * to ensure they have been submitted, but this is
- * easier.
- */
- mddev->pers->quiesce(mddev, 1);
- mddev->pers->quiesce(mddev, 0);
- md_barrier_request(mddev, bi);
+ if (unlikely(bi->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bi);
return 0;
}

@@ -4103,7 +4109,7 @@ static int make_request(mddev_t *mddev, struct bio * bi)
finish_wait(&conf->wait_for_overlap, &w);
set_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);
- if (mddev->barrier &&
+ if ((bi->bi_rw & REQ_SYNC) &&
!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
atomic_inc(&conf->preread_active_stripes);
release_stripe(sh);
@@ -4126,13 +4132,6 @@ static int make_request(mddev_t *mddev, struct bio * bi)
bio_endio(bi, 0);
}

- if (mddev->barrier) {
- /* We need to wait for the stripes to all be handled.
- * So: wait for preread_active_stripes to drop to 0.
- */
- wait_event(mddev->thread->wqueue,
- atomic_read(&conf->preread_active_stripes) == 0);
- }
return 0;
}

diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 36eaed5..2ace058 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -275,6 +275,7 @@ struct r6_state {
* filling
*/
#define R5_Wantdrain 13 /* dev->towrite needs to be drained */
+#define R5_WantFUA 14 /* Write should be FUA */
/*
* Write method
*/
--
1.7.1

2010-08-25 15:57:08

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 14/30] block/loop: implement REQ_FLUSH/FUA support

Deprecate REQ_HARDBARRIER and implement REQ_FLUSH/FUA instead. Also,
instead of checking file->f_op->fsync() directly, look at the value of
vfs_fsync() and ignore -EINVAL return.

Signed-off-by: Tejun Heo <[email protected]>
---
drivers/block/loop.c | 18 +++++++++---------
1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 953d1e1..5d27bc6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -477,17 +477,17 @@ static int do_bio_filebacked(struct loop_device *lo, struct bio *bio)
pos = ((loff_t) bio->bi_sector << 9) + lo->lo_offset;

if (bio_rw(bio) == WRITE) {
- bool barrier = (bio->bi_rw & REQ_HARDBARRIER);
struct file *file = lo->lo_backing_file;

- if (barrier) {
- if (unlikely(!file->f_op->fsync)) {
- ret = -EOPNOTSUPP;
- goto out;
- }
+ /* REQ_HARDBARRIER is deprecated */
+ if (bio->bi_rw & REQ_HARDBARRIER) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }

+ if (bio->bi_rw & REQ_FLUSH) {
ret = vfs_fsync(file, 0);
- if (unlikely(ret)) {
+ if (unlikely(ret && ret != -EINVAL)) {
ret = -EIO;
goto out;
}
@@ -495,9 +495,9 @@ static int do_bio_filebacked(struct loop_device *lo, struct bio *bio)

ret = lo_send(lo, bio, pos);

- if (barrier && !ret) {
+ if ((bio->bi_rw & REQ_FUA) && !ret) {
ret = vfs_fsync(file, 0);
- if (unlikely(ret))
+ if (unlikely(ret && ret != -EINVAL))
ret = -EIO;
}
} else
--
1.7.1

2010-08-25 15:57:15

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/30] block: implement REQ_FLUSH/FUA based interface for FLUSH/FUA requests

Now that the backend conversion is complete, export sequenced
FLUSH/FUA capability through REQ_FLUSH/FUA flags. REQ_FLUSH means the
device cache should be flushed before executing the request. REQ_FUA
means that the data in the request should be on non-volatile media on
completion.

Block layer will choose the correct way of implementing the semantics
and execute it. The request may be passed to the device directly if
the device can handle it; otherwise, it will be sequenced using one or
more proxy requests. Devices will never see REQ_FLUSH and/or FUA
which it doesn't support.

* QUEUE_ORDERED_* are removed and QUEUE_FSEQ_* are moved into
blk-flush.c.

* REQ_FLUSH w/o data can also be directly passed to drivers without
sequencing but some drivers assume that zero length requests don't
have rq->bio which isn't true for these requests requiring the use
of proxy requests.

* REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
copied from bio to request.

* WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
WRITE_FLUSH_FUA are added.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
block/blk-core.c | 2 +-
block/blk-flush.c | 85 +++++++++++++++++++++++--------------------
block/blk.h | 3 ++
include/linux/blk_types.h | 2 +-
include/linux/blkdev.h | 38 +------------------
include/linux/buffer_head.h | 2 +-
include/linux/fs.h | 19 +++++++---
7 files changed, 67 insertions(+), 84 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8870ae4..18455c4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1204,7 +1204,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)

spin_lock_irq(q->queue_lock);

- if (bio->bi_rw & REQ_HARDBARRIER) {
+ if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
where = ELEVATOR_INSERT_FRONT;
goto get_rq;
}
diff --git a/block/blk-flush.c b/block/blk-flush.c
index dd87322..452c552 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -1,5 +1,5 @@
/*
- * Functions related to barrier IO handling
+ * Functions to sequence FLUSH and FUA writes.
*/
#include <linux/kernel.h>
#include <linux/module.h>
@@ -9,6 +9,15 @@

#include "blk.h"

+/* FLUSH/FUA sequences */
+enum {
+ QUEUE_FSEQ_STARTED = (1 << 0), /* flushing in progress */
+ QUEUE_FSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */
+ QUEUE_FSEQ_DATA = (1 << 2), /* data write in progress */
+ QUEUE_FSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */
+ QUEUE_FSEQ_DONE = (1 << 4),
+};
+
static struct request *queue_next_fseq(struct request_queue *q);

unsigned blk_flush_cur_seq(struct request_queue *q)
@@ -79,6 +88,7 @@ static void queue_flush(struct request_queue *q, struct request *rq,

static struct request *queue_next_fseq(struct request_queue *q)
{
+ struct request *orig_rq = q->orig_flush_rq;
struct request *rq = &q->flush_rq;

switch (blk_flush_cur_seq(q)) {
@@ -87,12 +97,11 @@ static struct request *queue_next_fseq(struct request_queue *q)
break;

case QUEUE_FSEQ_DATA:
- /* initialize proxy request and queue it */
+ /* initialize proxy request, inherit FLUSH/FUA and queue it */
blk_rq_init(q, rq);
- init_request_from_bio(rq, q->orig_flush_rq->bio);
- rq->cmd_flags &= ~REQ_HARDBARRIER;
- if (q->ordered & QUEUE_ORDERED_DO_FUA)
- rq->cmd_flags |= REQ_FUA;
+ init_request_from_bio(rq, orig_rq->bio);
+ rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
+ rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
rq->end_io = flush_data_end_io;

elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
@@ -110,60 +119,58 @@ static struct request *queue_next_fseq(struct request_queue *q)

struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
+ unsigned int fflags = q->flush_flags; /* may change, cache it */
+ bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
+ bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
+ bool do_postflush = has_flush && !has_fua && (rq->cmd_flags & REQ_FUA);
unsigned skip = 0;

- if (!(rq->cmd_flags & REQ_HARDBARRIER))
+ /*
+ * Special case. If there's data but flush is not necessary,
+ * the request can be issued directly.
+ *
+ * Flush w/o data should be able to be issued directly too but
+ * currently some drivers assume that rq->bio contains
+ * non-zero data if it isn't NULL and empty FLUSH requests
+ * getting here usually have bio's without data.
+ */
+ if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
+ rq->cmd_flags &= ~REQ_FLUSH;
+ if (!has_fua)
+ rq->cmd_flags &= ~REQ_FUA;
return rq;
+ }

+ /*
+ * Sequenced flushes can't be processed in parallel. If
+ * another one is already in progress, queue for later
+ * processing.
+ */
if (q->flush_seq) {
- /*
- * Sequenced flush is already in progress and they
- * can't be processed in parallel. Queue for later
- * processing.
- */
list_move_tail(&rq->queuelist, &q->pending_flushes);
return NULL;
}

- if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) {
- /*
- * Queue ordering not supported. Terminate
- * with prejudice.
- */
- blk_dequeue_request(rq);
- __blk_end_request_all(rq, -EOPNOTSUPP);
- return NULL;
- }
-
/*
* Start a new flush sequence
*/
q->flush_err = 0;
- q->ordered = q->next_ordered;
q->flush_seq |= QUEUE_FSEQ_STARTED;

- /*
- * For an empty barrier, there's no actual BAR request, which
- * in turn makes POSTFLUSH unnecessary. Mask them off.
- */
- if (!blk_rq_sectors(rq))
- q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
- QUEUE_ORDERED_DO_POSTFLUSH);
-
- /* stash away the original request */
+ /* adjust FLUSH/FUA of the original request and stash it away */
+ rq->cmd_flags &= ~REQ_FLUSH;
+ if (!has_fua)
+ rq->cmd_flags &= ~REQ_FUA;
blk_dequeue_request(rq);
q->orig_flush_rq = rq;

- if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
+ /* skip unneded sequences and return the first one */
+ if (!do_preflush)
skip |= QUEUE_FSEQ_PREFLUSH;
-
- if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
+ if (!blk_rq_sectors(rq))
skip |= QUEUE_FSEQ_DATA;
-
- if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
+ if (!do_postflush)
skip |= QUEUE_FSEQ_POSTFLUSH;
-
- /* complete skipped sequences and return the first sequence */
return blk_flush_complete_seq(q, skip, 0);
}

diff --git a/block/blk.h b/block/blk.h
index 24b92bd..a09c18b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -60,6 +60,9 @@ static inline struct request *__elv_next_request(struct request_queue *q)
while (1) {
while (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
+ if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
+ rq == &q->flush_rq)
+ return rq;
rq = blk_do_flush(q, rq);
if (rq)
return rq;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 9192282..1797994 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -167,7 +167,7 @@ enum rq_flag_bits {
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
#define REQ_COMMON_MASK \
(REQ_WRITE | REQ_FAILFAST_MASK | REQ_HARDBARRIER | REQ_SYNC | \
- REQ_META| REQ_DISCARD | REQ_NOIDLE)
+ REQ_META | REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)

#define REQ_UNPLUG (1 << __REQ_UNPLUG)
#define REQ_RAHEAD (1 << __REQ_RAHEAD)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1cd83ec..8ef705f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -357,7 +357,6 @@ struct request_queue
/*
* for flush operations
*/
- unsigned int ordered, next_ordered;
unsigned int flush_flags;
unsigned int flush_seq;
int flush_err;
@@ -465,40 +464,6 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
__clear_bit(flag, &q->queue_flags);
}

-enum {
- /*
- * Hardbarrier is supported with one of the following methods.
- *
- * NONE : hardbarrier unsupported
- * DRAIN : ordering by draining is enough
- * DRAIN_FLUSH : ordering by draining w/ pre and post flushes
- * DRAIN_FUA : ordering by draining w/ pre flush and FUA write
- */
- QUEUE_ORDERED_DO_PREFLUSH = 0x10,
- QUEUE_ORDERED_DO_BAR = 0x20,
- QUEUE_ORDERED_DO_POSTFLUSH = 0x40,
- QUEUE_ORDERED_DO_FUA = 0x80,
-
- QUEUE_ORDERED_NONE = 0x00,
-
- QUEUE_ORDERED_DRAIN = QUEUE_ORDERED_DO_BAR,
- QUEUE_ORDERED_DRAIN_FLUSH = QUEUE_ORDERED_DRAIN |
- QUEUE_ORDERED_DO_PREFLUSH |
- QUEUE_ORDERED_DO_POSTFLUSH,
- QUEUE_ORDERED_DRAIN_FUA = QUEUE_ORDERED_DRAIN |
- QUEUE_ORDERED_DO_PREFLUSH |
- QUEUE_ORDERED_DO_FUA,
-
- /*
- * FLUSH/FUA sequences.
- */
- QUEUE_FSEQ_STARTED = (1 << 0), /* flushing in progress */
- QUEUE_FSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */
- QUEUE_FSEQ_DATA = (1 << 2), /* data write in progress */
- QUEUE_FSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */
- QUEUE_FSEQ_DONE = (1 << 4),
-};
-
#define blk_queue_plugged(q) test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
#define blk_queue_tagged(q) test_bit(QUEUE_FLAG_QUEUED, &(q)->queue_flags)
#define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
@@ -578,7 +543,8 @@ static inline void blk_clear_queue_full(struct request_queue *q, int sync)
* it already be started by driver.
*/
#define RQ_NOMERGE_FLAGS \
- (REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER)
+ (REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER | \
+ REQ_FLUSH | REQ_FUA)
#define rq_mergeable(rq) \
(!((rq)->cmd_flags & RQ_NOMERGE_FLAGS) && \
(((rq)->cmd_flags & REQ_DISCARD) || \
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index ec94c12..fc999f5 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -32,7 +32,7 @@ enum bh_state_bits {
BH_Delay, /* Buffer is not yet allocated on disk */
BH_Boundary, /* Block is followed by a discontiguity */
BH_Write_EIO, /* I/O error on write */
- BH_Eopnotsupp, /* operation not supported (barrier) */
+ BH_Eopnotsupp, /* DEPRECATED: operation not supported (barrier) */
BH_Unwritten, /* Buffer is allocated on disk but not written */
BH_Quiet, /* Buffer Error Prinks to be quiet */

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 76041b6..352c486 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -135,12 +135,13 @@ struct inodes_stat_t {
* immediately after submission. The write equivalent
* of READ_SYNC.
* WRITE_ODIRECT_PLUG Special case write for O_DIRECT only.
- * WRITE_BARRIER Like WRITE_SYNC, but tells the block layer that all
- * previously submitted writes must be safely on storage
- * before this one is started. Also guarantees that when
- * this write is complete, it itself is also safely on
- * storage. Prevents reordering of writes on both sides
- * of this IO.
+ * WRITE_BARRIER DEPRECATED. Always fails. Use FLUSH/FUA instead.
+ * WRITE_FLUSH Like WRITE_SYNC but with preceding cache flush.
+ * WRITE_FUA Like WRITE_SYNC but data is guaranteed to be on
+ * non-volatile media on completion.
+ * WRITE_FLUSH_FUA Combination of WRITE_FLUSH and FUA. The IO is preceded
+ * by a cache flush and data is guaranteed to be on
+ * non-volatile media on completion.
*
*/
#define RW_MASK REQ_WRITE
@@ -158,6 +159,12 @@ struct inodes_stat_t {
#define WRITE_META (WRITE | REQ_META)
#define WRITE_BARRIER (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
REQ_HARDBARRIER)
+#define WRITE_FLUSH (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
+ REQ_FLUSH)
+#define WRITE_FUA (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
+ REQ_FUA)
+#define WRITE_FLUSH_FUA (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
+ REQ_FLUSH | REQ_FUA)

/*
* These aren't really reads or writes, they pass down information about
--
1.7.1

2010-08-25 15:58:16

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 11/30] block: filter flush bio's in __generic_make_request()

There are a number of make_request based drivers which don't support
cache flushes. Filter out flush bio's in __generic_make_request() so
that they don't have to worry about them. All FLUSH/FUA requests with
data are converted to regular IO requests and empty ones are completed
immediately.

Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-core.c | 13 +++++++++++++
1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 18455c4..495bdc4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1509,6 +1509,19 @@ static inline void __generic_make_request(struct bio *bio)
if (bio_check_eod(bio, nr_sectors))
goto end_io;

+ /*
+ * Filter flush bio's early so that make_request based
+ * drivers without flush support don't have to worry
+ * about them.
+ */
+ if ((bio->bi_rw & (REQ_FLUSH | REQ_FUA)) && !q->flush_flags) {
+ bio->bi_rw &= ~(REQ_FLUSH | REQ_FUA);
+ if (!nr_sectors) {
+ err = 0;
+ goto end_io;
+ }
+ }
+
if ((bio->bi_rw & REQ_DISCARD) &&
(!blk_queue_discard(q) ||
((bio->bi_rw & REQ_SECURE) &&
--
1.7.1

2010-08-25 15:58:14

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/30] block: misc cleanups in barrier code

Make the following cleanups in preparation of barrier/flush update.

* blk_do_ordered() declaration is moved from include/linux/blkdev.h to
block/blk.h.

* blk_do_ordered() now returns pointer to struct request, with %NULL
meaning "try the next request" and ERR_PTR(-EAGAIN) "try again
later". The third case will be dropped with further changes.

* In the initialization of proxy barrier request, data direction is
already set by init_request_from_bio(). Drop unnecessary explicit
REQ_WRITE setting and move init_request_from_bio() above REQ_FUA
flag setting.

* add_request() is collapsed into __make_request().

These changes don't make any functional difference.

Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-barrier.c | 32 ++++++++++++++------------------
block/blk-core.c | 21 ++++-----------------
block/blk.h | 7 +++++--
include/linux/blkdev.h | 1 -
4 files changed, 23 insertions(+), 38 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index ed0aba5..f1be85b 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -110,9 +110,9 @@ static void queue_flush(struct request_queue *q, unsigned which)
elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
}

-static inline bool start_ordered(struct request_queue *q, struct request **rqp)
+static inline struct request *start_ordered(struct request_queue *q,
+ struct request *rq)
{
- struct request *rq = *rqp;
unsigned skip = 0;

q->orderr = 0;
@@ -149,11 +149,9 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)

/* initialize proxy request and queue it */
blk_rq_init(q, rq);
- if (bio_data_dir(q->orig_bar_rq->bio) == WRITE)
- rq->cmd_flags |= REQ_WRITE;
+ init_request_from_bio(rq, q->orig_bar_rq->bio);
if (q->ordered & QUEUE_ORDERED_DO_FUA)
rq->cmd_flags |= REQ_FUA;
- init_request_from_bio(rq, q->orig_bar_rq->bio);
rq->end_io = bar_end_io;

elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
@@ -171,27 +169,26 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)
else
skip |= QUEUE_ORDSEQ_DRAIN;

- *rqp = rq;
-
/*
* Complete skipped sequences. If whole sequence is complete,
- * return false to tell elevator that this request is gone.
+ * return %NULL to tell elevator that this request is gone.
*/
- return !blk_ordered_complete_seq(q, skip, 0);
+ if (blk_ordered_complete_seq(q, skip, 0))
+ rq = NULL;
+ return rq;
}

-bool blk_do_ordered(struct request_queue *q, struct request **rqp)
+struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
{
- struct request *rq = *rqp;
const int is_barrier = rq->cmd_type == REQ_TYPE_FS &&
(rq->cmd_flags & REQ_HARDBARRIER);

if (!q->ordseq) {
if (!is_barrier)
- return true;
+ return rq;

if (q->next_ordered != QUEUE_ORDERED_NONE)
- return start_ordered(q, rqp);
+ return start_ordered(q, rq);
else {
/*
* Queue ordering not supported. Terminate
@@ -199,8 +196,7 @@ bool blk_do_ordered(struct request_queue *q, struct request **rqp)
*/
blk_dequeue_request(rq);
__blk_end_request_all(rq, -EOPNOTSUPP);
- *rqp = NULL;
- return false;
+ return NULL;
}
}

@@ -211,14 +207,14 @@ bool blk_do_ordered(struct request_queue *q, struct request **rqp)
/* Special requests are not subject to ordering rules. */
if (rq->cmd_type != REQ_TYPE_FS &&
rq != &q->pre_flush_rq && rq != &q->post_flush_rq)
- return true;
+ return rq;

/* Ordered by draining. Wait for turn. */
WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
- *rqp = NULL;
+ rq = ERR_PTR(-EAGAIN);

- return true;
+ return rq;
}

static void bio_end_empty_barrier(struct bio *bio, int err)
diff --git a/block/blk-core.c b/block/blk-core.c
index f063541..f8d37a8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1037,22 +1037,6 @@ void blk_insert_request(struct request_queue *q, struct request *rq,
}
EXPORT_SYMBOL(blk_insert_request);

-/*
- * add-request adds a request to the linked list.
- * queue lock is held and interrupts disabled, as we muck with the
- * request queue list.
- */
-static inline void add_request(struct request_queue *q, struct request *req)
-{
- drive_stat_acct(req, 1);
-
- /*
- * elevator indicated where it wants this request to be
- * inserted at elevator_merge time
- */
- __elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
-}
-
static void part_round_stats_single(int cpu, struct hd_struct *part,
unsigned long now)
{
@@ -1316,7 +1300,10 @@ get_rq:
req->cpu = blk_cpu_to_group(smp_processor_id());
if (queue_should_plug(q) && elv_queue_empty(q))
blk_plug_device(q);
- add_request(q, req);
+
+ /* insert the request into the elevator */
+ drive_stat_acct(req, 1);
+ __elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
out:
if (unplug || !queue_should_plug(q))
__generic_unplug_device(q);
diff --git a/block/blk.h b/block/blk.h
index 6e7dc87..874eb4e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -51,6 +51,8 @@ static inline void blk_clear_rq_complete(struct request *rq)
*/
#define ELV_ON_HASH(rq) (!hlist_unhashed(&(rq)->hash))

+struct request *blk_do_ordered(struct request_queue *q, struct request *rq);
+
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
@@ -58,8 +60,9 @@ static inline struct request *__elv_next_request(struct request_queue *q)
while (1) {
while (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (blk_do_ordered(q, &rq))
- return rq;
+ rq = blk_do_ordered(q, rq);
+ if (rq)
+ return !IS_ERR(rq) ? rq : NULL;
}

if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e97911d..996549d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -869,7 +869,6 @@ extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
-extern bool blk_do_ordered(struct request_queue *, struct request **);
extern unsigned blk_ordered_cur_seq(struct request_queue *);
extern unsigned blk_ordered_req_seq(struct request *);
extern bool blk_ordered_complete_seq(struct request_queue *, unsigned, int);
--
1.7.1

2010-08-25 15:59:19

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/30] block: drop barrier ordering by queue draining

Filesystems will take all the responsibilities for ordering requests
around commit writes and will only indicate how the commit writes
themselves should be handled by block layers. This patch drops
barrier ordering by queue draining from block layer. Ordering by
draining implementation was somewhat invasive to request handling.
List of notable changes follow.

* Each queue has 1 bit color which is flipped on each barrier issue.
This is used to track whether a given request is issued before the
current barrier or not. REQ_ORDERED_COLOR flag and coloring
implementation in __elv_add_request() are removed.

* Requests which shouldn't be processed yet for draining were stalled
by returning -EAGAIN from blk_do_ordered() according to the test
result between blk_ordered_req_seq() and blk_blk_ordered_cur_seq().
This logic is removed.

* Draining completion logic in elv_completed_request() removed.

* All barrier sequence requests were queued to request queue and then
trckled to lower layer according to progress and thus maintaining
request orders during requeue was necessary. This is replaced by
queueing the next request in the barrier sequence only after the
current one is complete from blk_ordered_complete_seq(), which
removes the need for multiple proxy requests in struct request_queue
and the request sorting logic in the ELEVATOR_INSERT_REQUEUE path of
elv_insert().

* As barriers no longer have ordering constraints, there's no need to
dump the whole elevator onto the dispatch queue on each barrier.
Insert barriers at the front instead.

* If other barrier requests come to the front of the dispatch queue
while one is already in progress, they are stored in
q->pending_barriers and restored to dispatch queue one-by-one after
each barrier completion from blk_ordered_complete_seq().

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
block/blk-barrier.c | 220 ++++++++++++++++++---------------------------
block/blk-core.c | 11 ++-
block/blk.h | 2 +-
block/elevator.c | 79 ++--------------
include/linux/blk_types.h | 2 -
include/linux/blkdev.h | 19 ++---
6 files changed, 113 insertions(+), 220 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index f1be85b..e8b2e5c 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -9,6 +9,8 @@

#include "blk.h"

+static struct request *queue_next_ordseq(struct request_queue *q);
+
/*
* Cache flushing for ordered writes handling
*/
@@ -19,38 +21,10 @@ unsigned blk_ordered_cur_seq(struct request_queue *q)
return 1 << ffz(q->ordseq);
}

-unsigned blk_ordered_req_seq(struct request *rq)
-{
- struct request_queue *q = rq->q;
-
- BUG_ON(q->ordseq == 0);
-
- if (rq == &q->pre_flush_rq)
- return QUEUE_ORDSEQ_PREFLUSH;
- if (rq == &q->bar_rq)
- return QUEUE_ORDSEQ_BAR;
- if (rq == &q->post_flush_rq)
- return QUEUE_ORDSEQ_POSTFLUSH;
-
- /*
- * !fs requests don't need to follow barrier ordering. Always
- * put them at the front. This fixes the following deadlock.
- *
- * http://thread.gmane.org/gmane.linux.kernel/537473
- */
- if (rq->cmd_type != REQ_TYPE_FS)
- return QUEUE_ORDSEQ_DRAIN;
-
- if ((rq->cmd_flags & REQ_ORDERED_COLOR) ==
- (q->orig_bar_rq->cmd_flags & REQ_ORDERED_COLOR))
- return QUEUE_ORDSEQ_DRAIN;
- else
- return QUEUE_ORDSEQ_DONE;
-}
-
-bool blk_ordered_complete_seq(struct request_queue *q, unsigned seq, int error)
+static struct request *blk_ordered_complete_seq(struct request_queue *q,
+ unsigned seq, int error)
{
- struct request *rq;
+ struct request *next_rq = NULL;

if (error && !q->orderr)
q->orderr = error;
@@ -58,16 +32,22 @@ bool blk_ordered_complete_seq(struct request_queue *q, unsigned seq, int error)
BUG_ON(q->ordseq & seq);
q->ordseq |= seq;

- if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE)
- return false;
-
- /*
- * Okay, sequence complete.
- */
- q->ordseq = 0;
- rq = q->orig_bar_rq;
- __blk_end_request_all(rq, q->orderr);
- return true;
+ if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
+ /* not complete yet, queue the next ordered sequence */
+ next_rq = queue_next_ordseq(q);
+ } else {
+ /* complete this barrier request */
+ __blk_end_request_all(q->orig_bar_rq, q->orderr);
+ q->orig_bar_rq = NULL;
+ q->ordseq = 0;
+
+ /* dispatch the next barrier if there's one */
+ if (!list_empty(&q->pending_barriers)) {
+ next_rq = list_entry_rq(q->pending_barriers.next);
+ list_move(&next_rq->queuelist, &q->queue_head);
+ }
+ }
+ return next_rq;
}

static void pre_flush_end_io(struct request *rq, int error)
@@ -88,133 +68,105 @@ static void post_flush_end_io(struct request *rq, int error)
blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
}

-static void queue_flush(struct request_queue *q, unsigned which)
+static void queue_flush(struct request_queue *q, struct request *rq,
+ rq_end_io_fn *end_io)
{
- struct request *rq;
- rq_end_io_fn *end_io;
-
- if (which == QUEUE_ORDERED_DO_PREFLUSH) {
- rq = &q->pre_flush_rq;
- end_io = pre_flush_end_io;
- } else {
- rq = &q->post_flush_rq;
- end_io = post_flush_end_io;
- }
-
blk_rq_init(q, rq);
rq->cmd_type = REQ_TYPE_FS;
- rq->cmd_flags = REQ_HARDBARRIER | REQ_FLUSH;
+ rq->cmd_flags = REQ_FLUSH;
rq->rq_disk = q->orig_bar_rq->rq_disk;
rq->end_io = end_io;

elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
}

-static inline struct request *start_ordered(struct request_queue *q,
- struct request *rq)
+static struct request *queue_next_ordseq(struct request_queue *q)
{
- unsigned skip = 0;
-
- q->orderr = 0;
- q->ordered = q->next_ordered;
- q->ordseq |= QUEUE_ORDSEQ_STARTED;
-
- /*
- * For an empty barrier, there's no actual BAR request, which
- * in turn makes POSTFLUSH unnecessary. Mask them off.
- */
- if (!blk_rq_sectors(rq))
- q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
- QUEUE_ORDERED_DO_POSTFLUSH);
-
- /* stash away the original request */
- blk_dequeue_request(rq);
- q->orig_bar_rq = rq;
- rq = NULL;
-
- /*
- * Queue ordered sequence. As we stack them at the head, we
- * need to queue in reverse order. Note that we rely on that
- * no fs request uses ELEVATOR_INSERT_FRONT and thus no fs
- * request gets inbetween ordered sequence.
- */
- if (q->ordered & QUEUE_ORDERED_DO_POSTFLUSH) {
- queue_flush(q, QUEUE_ORDERED_DO_POSTFLUSH);
- rq = &q->post_flush_rq;
- } else
- skip |= QUEUE_ORDSEQ_POSTFLUSH;
+ struct request *rq = &q->bar_rq;

- if (q->ordered & QUEUE_ORDERED_DO_BAR) {
- rq = &q->bar_rq;
+ switch (blk_ordered_cur_seq(q)) {
+ case QUEUE_ORDSEQ_PREFLUSH:
+ queue_flush(q, rq, pre_flush_end_io);
+ break;

+ case QUEUE_ORDSEQ_BAR:
/* initialize proxy request and queue it */
blk_rq_init(q, rq);
init_request_from_bio(rq, q->orig_bar_rq->bio);
+ rq->cmd_flags &= ~REQ_HARDBARRIER;
if (q->ordered & QUEUE_ORDERED_DO_FUA)
rq->cmd_flags |= REQ_FUA;
rq->end_io = bar_end_io;

elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
- } else
- skip |= QUEUE_ORDSEQ_BAR;
+ break;

- if (q->ordered & QUEUE_ORDERED_DO_PREFLUSH) {
- queue_flush(q, QUEUE_ORDERED_DO_PREFLUSH);
- rq = &q->pre_flush_rq;
- } else
- skip |= QUEUE_ORDSEQ_PREFLUSH;
+ case QUEUE_ORDSEQ_POSTFLUSH:
+ queue_flush(q, rq, post_flush_end_io);
+ break;

- if (queue_in_flight(q))
- rq = NULL;
- else
- skip |= QUEUE_ORDSEQ_DRAIN;
-
- /*
- * Complete skipped sequences. If whole sequence is complete,
- * return %NULL to tell elevator that this request is gone.
- */
- if (blk_ordered_complete_seq(q, skip, 0))
- rq = NULL;
+ default:
+ BUG();
+ }
return rq;
}

struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
{
- const int is_barrier = rq->cmd_type == REQ_TYPE_FS &&
- (rq->cmd_flags & REQ_HARDBARRIER);
-
- if (!q->ordseq) {
- if (!is_barrier)
- return rq;
-
- if (q->next_ordered != QUEUE_ORDERED_NONE)
- return start_ordered(q, rq);
- else {
- /*
- * Queue ordering not supported. Terminate
- * with prejudice.
- */
- blk_dequeue_request(rq);
- __blk_end_request_all(rq, -EOPNOTSUPP);
- return NULL;
- }
+ unsigned skip = 0;
+
+ if (!(rq->cmd_flags & REQ_HARDBARRIER))
+ return rq;
+
+ if (q->ordseq) {
+ /*
+ * Barrier is already in progress and they can't be
+ * processed in parallel. Queue for later processing.
+ */
+ list_move_tail(&rq->queuelist, &q->pending_barriers);
+ return NULL;
+ }
+
+ if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) {
+ /*
+ * Queue ordering not supported. Terminate
+ * with prejudice.
+ */
+ blk_dequeue_request(rq);
+ __blk_end_request_all(rq, -EOPNOTSUPP);
+ return NULL;
}

/*
- * Ordered sequence in progress
+ * Start a new ordered sequence
*/
+ q->orderr = 0;
+ q->ordered = q->next_ordered;
+ q->ordseq |= QUEUE_ORDSEQ_STARTED;

- /* Special requests are not subject to ordering rules. */
- if (rq->cmd_type != REQ_TYPE_FS &&
- rq != &q->pre_flush_rq && rq != &q->post_flush_rq)
- return rq;
+ /*
+ * For an empty barrier, there's no actual BAR request, which
+ * in turn makes POSTFLUSH unnecessary. Mask them off.
+ */
+ if (!blk_rq_sectors(rq))
+ q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
+ QUEUE_ORDERED_DO_POSTFLUSH);

- /* Ordered by draining. Wait for turn. */
- WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
- if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
- rq = ERR_PTR(-EAGAIN);
+ /* stash away the original request */
+ blk_dequeue_request(rq);
+ q->orig_bar_rq = rq;

- return rq;
+ if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
+ skip |= QUEUE_ORDSEQ_PREFLUSH;
+
+ if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
+ skip |= QUEUE_ORDSEQ_BAR;
+
+ if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
+ skip |= QUEUE_ORDSEQ_POSTFLUSH;
+
+ /* complete skipped sequences and return the first sequence */
+ return blk_ordered_complete_seq(q, skip, 0);
}

static void bio_end_empty_barrier(struct bio *bio, int err)
diff --git a/block/blk-core.c b/block/blk-core.c
index f8d37a8..d316662 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -520,6 +520,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
init_timer(&q->unplug_timer);
setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
INIT_LIST_HEAD(&q->timeout_list);
+ INIT_LIST_HEAD(&q->pending_barriers);
INIT_WORK(&q->unplug_work, blk_unplug_work);

kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1185,6 +1186,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
const bool sync = (bio->bi_rw & REQ_SYNC);
const bool unplug = (bio->bi_rw & REQ_UNPLUG);
const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
+ int where = ELEVATOR_INSERT_SORT;
int rw_flags;

/* REQ_HARDBARRIER is no more */
@@ -1203,7 +1205,12 @@ static int __make_request(struct request_queue *q, struct bio *bio)

spin_lock_irq(q->queue_lock);

- if (unlikely((bio->bi_rw & REQ_HARDBARRIER)) || elv_queue_empty(q))
+ if (bio->bi_rw & REQ_HARDBARRIER) {
+ where = ELEVATOR_INSERT_FRONT;
+ goto get_rq;
+ }
+
+ if (elv_queue_empty(q))
goto get_rq;

el_ret = elv_merge(q, &req, bio);
@@ -1303,7 +1310,7 @@ get_rq:

/* insert the request into the elevator */
drive_stat_acct(req, 1);
- __elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
+ __elv_add_request(q, req, where, 0);
out:
if (unplug || !queue_should_plug(q))
__generic_unplug_device(q);
diff --git a/block/blk.h b/block/blk.h
index 874eb4e..08081e4 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -62,7 +62,7 @@ static inline struct request *__elv_next_request(struct request_queue *q)
rq = list_entry_rq(q->queue_head.next);
rq = blk_do_ordered(q, rq);
if (rq)
- return !IS_ERR(rq) ? rq : NULL;
+ return rq;
}

if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
diff --git a/block/elevator.c b/block/elevator.c
index ec585c9..241c69c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -617,8 +617,6 @@ void elv_quiesce_end(struct request_queue *q)

void elv_insert(struct request_queue *q, struct request *rq, int where)
{
- struct list_head *pos;
- unsigned ordseq;
int unplug_it = 1;

trace_block_rq_insert(q, rq);
@@ -626,9 +624,16 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
rq->q = q;

switch (where) {
+ case ELEVATOR_INSERT_REQUEUE:
+ /*
+ * Most requeues happen because of a busy condition,
+ * don't force unplug of the queue for that case.
+ * Clear unplug_it and fall through.
+ */
+ unplug_it = 0;
+
case ELEVATOR_INSERT_FRONT:
rq->cmd_flags |= REQ_SOFTBARRIER;
-
list_add(&rq->queuelist, &q->queue_head);
break;

@@ -668,36 +673,6 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
q->elevator->ops->elevator_add_req_fn(q, rq);
break;

- case ELEVATOR_INSERT_REQUEUE:
- /*
- * If ordered flush isn't in progress, we do front
- * insertion; otherwise, requests should be requeued
- * in ordseq order.
- */
- rq->cmd_flags |= REQ_SOFTBARRIER;
-
- /*
- * Most requeues happen because of a busy condition,
- * don't force unplug of the queue for that case.
- */
- unplug_it = 0;
-
- if (q->ordseq == 0) {
- list_add(&rq->queuelist, &q->queue_head);
- break;
- }
-
- ordseq = blk_ordered_req_seq(rq);
-
- list_for_each(pos, &q->queue_head) {
- struct request *pos_rq = list_entry_rq(pos);
- if (ordseq <= blk_ordered_req_seq(pos_rq))
- break;
- }
-
- list_add_tail(&rq->queuelist, pos);
- break;
-
default:
printk(KERN_ERR "%s: bad insertion point %d\n",
__func__, where);
@@ -716,26 +691,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
void __elv_add_request(struct request_queue *q, struct request *rq, int where,
int plug)
{
- if (q->ordcolor)
- rq->cmd_flags |= REQ_ORDERED_COLOR;
-
if (rq->cmd_flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER)) {
- /*
- * toggle ordered color
- */
- if (rq->cmd_flags & REQ_HARDBARRIER)
- q->ordcolor ^= 1;
-
- /*
- * barriers implicitly indicate back insertion
- */
- if (where == ELEVATOR_INSERT_SORT)
- where = ELEVATOR_INSERT_BACK;
-
- /*
- * this request is scheduling boundary, update
- * end_sector
- */
+ /* barriers are scheduling boundary, update end_sector */
if (rq->cmd_type == REQ_TYPE_FS ||
(rq->cmd_flags & REQ_DISCARD)) {
q->end_sector = rq_end_sector(rq);
@@ -855,24 +812,6 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
e->ops->elevator_completed_req_fn)
e->ops->elevator_completed_req_fn(q, rq);
}
-
- /*
- * Check if the queue is waiting for fs requests to be
- * drained for flush sequence.
- */
- if (unlikely(q->ordseq)) {
- struct request *next = NULL;
-
- if (!list_empty(&q->queue_head))
- next = list_entry_rq(q->queue_head.next);
-
- if (!queue_in_flight(q) &&
- blk_ordered_cur_seq(q) == QUEUE_ORDSEQ_DRAIN &&
- (!next || blk_ordered_req_seq(next) > QUEUE_ORDSEQ_DRAIN)) {
- blk_ordered_complete_seq(q, QUEUE_ORDSEQ_DRAIN, 0);
- __blk_run_queue(q);
- }
- }
}

#define to_elv(atr) container_of((atr), struct elv_fs_entry, attr)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index ca83a97..9192282 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -143,7 +143,6 @@ enum rq_flag_bits {
__REQ_FAILED, /* set if the request failed */
__REQ_QUIET, /* don't worry about errors */
__REQ_PREEMPT, /* set for "ide_preempt" requests */
- __REQ_ORDERED_COLOR, /* is before or after barrier */
__REQ_ALLOCED, /* request came from our alloc pool */
__REQ_COPY_USER, /* contains copies of user pages */
__REQ_INTEGRITY, /* integrity metadata has been remapped */
@@ -184,7 +183,6 @@ enum rq_flag_bits {
#define REQ_FAILED (1 << __REQ_FAILED)
#define REQ_QUIET (1 << __REQ_QUIET)
#define REQ_PREEMPT (1 << __REQ_PREEMPT)
-#define REQ_ORDERED_COLOR (1 << __REQ_ORDERED_COLOR)
#define REQ_ALLOCED (1 << __REQ_ALLOCED)
#define REQ_COPY_USER (1 << __REQ_COPY_USER)
#define REQ_INTEGRITY (1 << __REQ_INTEGRITY)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 996549d..20a3710 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -360,9 +360,10 @@ struct request_queue
unsigned int flush_flags;

unsigned int ordered, next_ordered, ordseq;
- int orderr, ordcolor;
- struct request pre_flush_rq, bar_rq, post_flush_rq;
+ int orderr;
+ struct request bar_rq;
struct request *orig_bar_rq;
+ struct list_head pending_barriers;

struct mutex sysfs_lock;

@@ -491,12 +492,11 @@ enum {
/*
* Ordered operation sequence
*/
- QUEUE_ORDSEQ_STARTED = 0x01, /* flushing in progress */
- QUEUE_ORDSEQ_DRAIN = 0x02, /* waiting for the queue to be drained */
- QUEUE_ORDSEQ_PREFLUSH = 0x04, /* pre-flushing in progress */
- QUEUE_ORDSEQ_BAR = 0x08, /* original barrier req in progress */
- QUEUE_ORDSEQ_POSTFLUSH = 0x10, /* post-flushing in progress */
- QUEUE_ORDSEQ_DONE = 0x20,
+ QUEUE_ORDSEQ_STARTED = (1 << 0), /* flushing in progress */
+ QUEUE_ORDSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */
+ QUEUE_ORDSEQ_BAR = (1 << 2), /* barrier write in progress */
+ QUEUE_ORDSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */
+ QUEUE_ORDSEQ_DONE = (1 << 4),
};

#define blk_queue_plugged(q) test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
@@ -869,9 +869,6 @@ extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
-extern unsigned blk_ordered_cur_seq(struct request_queue *);
-extern unsigned blk_ordered_req_seq(struct request *);
-extern bool blk_ordered_complete_seq(struct request_queue *, unsigned, int);

extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
extern void blk_dump_rq_flags(struct request *, char *);
--
1.7.1

2010-08-25 15:59:22

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 29/30] block: remove the BLKDEV_IFL_BARRIER flag

From: Christoph Hellwig <[email protected]>

Remove support for barriers on discards, which is unused now. Also
remove the DISCARD_NOBARRIER I/O type in favour of just setting the
rw flags up locally in blkdev_issue_discard.

tj: Also remove DISCARD_SECURE and use REQ_SECURE directly.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Mike Snitzer <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-lib.c | 18 ++----------------
include/linux/blkdev.h | 2 --
include/linux/fs.h | 8 --------
3 files changed, 2 insertions(+), 26 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index c392029..fe2e6ed 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -39,8 +39,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
{
DECLARE_COMPLETION_ONSTACK(wait);
struct request_queue *q = bdev_get_queue(bdev);
- int type = flags & BLKDEV_IFL_BARRIER ?
- DISCARD_BARRIER : DISCARD_NOBARRIER;
+ int type = REQ_WRITE | REQ_DISCARD;
unsigned int max_discard_sectors;
struct bio *bio;
int ret = 0;
@@ -65,7 +64,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
if (flags & BLKDEV_IFL_SECURE) {
if (!blk_queue_secdiscard(q))
return -EOPNOTSUPP;
- type |= DISCARD_SECURE;
+ type |= REQ_SECURE;
}

while (nr_sects && !ret) {
@@ -162,12 +161,6 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
bb.wait = &wait;
bb.end_io = NULL;

- if (flags & BLKDEV_IFL_BARRIER) {
- /* issue async barrier before the data */
- ret = blkdev_issue_flush(bdev, gfp_mask, NULL, 0);
- if (ret)
- return ret;
- }
submit:
ret = 0;
while (nr_sects != 0) {
@@ -199,13 +192,6 @@ submit:
issued++;
submit_bio(WRITE, bio);
}
- /*
- * When all data bios are in flight. Send final barrier if requeted.
- */
- if (nr_sects == 0 && flags & BLKDEV_IFL_BARRIER)
- ret = blkdev_issue_flush(bdev, gfp_mask, NULL,
- flags & BLKDEV_IFL_WAIT);
-

if (flags & BLKDEV_IFL_WAIT)
/* Wait for bios in-flight */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6b305eb..cfcb3a6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -869,11 +869,9 @@ static inline struct request *blk_map_queue_find_tag(struct blk_queue_tag *bqt,
}
enum{
BLKDEV_WAIT, /* wait for completion */
- BLKDEV_BARRIER, /* issue request with barrier */
BLKDEV_SECURE, /* secure discard */
};
#define BLKDEV_IFL_WAIT (1 << BLKDEV_WAIT)
-#define BLKDEV_IFL_BARRIER (1 << BLKDEV_BARRIER)
#define BLKDEV_IFL_SECURE (1 << BLKDEV_SECURE)
extern int blkdev_issue_flush(struct block_device *, gfp_t, sector_t *,
unsigned long);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 352c486..32703a9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -166,14 +166,6 @@ struct inodes_stat_t {
#define WRITE_FLUSH_FUA (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
REQ_FLUSH | REQ_FUA)

-/*
- * These aren't really reads or writes, they pass down information about
- * parts of device that are now unused by the file system.
- */
-#define DISCARD_NOBARRIER (WRITE | REQ_DISCARD)
-#define DISCARD_BARRIER (WRITE | REQ_DISCARD | REQ_HARDBARRIER)
-#define DISCARD_SECURE (DISCARD_NOBARRIER | REQ_SECURE)
-
#define SEL_IN 1
#define SEL_OUT 2
#define SEL_EX 4
--
1.7.1

2010-08-25 15:59:26

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/30] block: rename blk-barrier.c to blk-flush.c

Without ordering requirements, barrier and ordering are minomers.
Rename block/blk-barrier.c to block/blk-flush.c. Rename of symbols
will follow.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
block/Makefile | 2 +-
block/blk-barrier.c | 248 ---------------------------------------------------
block/blk-flush.c | 248 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 249 insertions(+), 249 deletions(-)
delete mode 100644 block/blk-barrier.c
create mode 100644 block/blk-flush.c

diff --git a/block/Makefile b/block/Makefile
index 0bb499a..f627e4b 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -3,7 +3,7 @@
#

obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
- blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
+ blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-iopoll.o blk-lib.o ioctl.o genhd.o scsi_ioctl.o

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
deleted file mode 100644
index e8b2e5c..0000000
--- a/block/blk-barrier.c
+++ /dev/null
@@ -1,248 +0,0 @@
-/*
- * Functions related to barrier IO handling
- */
-#include <linux/kernel.h>
-#include <linux/module.h>
-#include <linux/bio.h>
-#include <linux/blkdev.h>
-#include <linux/gfp.h>
-
-#include "blk.h"
-
-static struct request *queue_next_ordseq(struct request_queue *q);
-
-/*
- * Cache flushing for ordered writes handling
- */
-unsigned blk_ordered_cur_seq(struct request_queue *q)
-{
- if (!q->ordseq)
- return 0;
- return 1 << ffz(q->ordseq);
-}
-
-static struct request *blk_ordered_complete_seq(struct request_queue *q,
- unsigned seq, int error)
-{
- struct request *next_rq = NULL;
-
- if (error && !q->orderr)
- q->orderr = error;
-
- BUG_ON(q->ordseq & seq);
- q->ordseq |= seq;
-
- if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
- /* not complete yet, queue the next ordered sequence */
- next_rq = queue_next_ordseq(q);
- } else {
- /* complete this barrier request */
- __blk_end_request_all(q->orig_bar_rq, q->orderr);
- q->orig_bar_rq = NULL;
- q->ordseq = 0;
-
- /* dispatch the next barrier if there's one */
- if (!list_empty(&q->pending_barriers)) {
- next_rq = list_entry_rq(q->pending_barriers.next);
- list_move(&next_rq->queuelist, &q->queue_head);
- }
- }
- return next_rq;
-}
-
-static void pre_flush_end_io(struct request *rq, int error)
-{
- elv_completed_request(rq->q, rq);
- blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_PREFLUSH, error);
-}
-
-static void bar_end_io(struct request *rq, int error)
-{
- elv_completed_request(rq->q, rq);
- blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_BAR, error);
-}
-
-static void post_flush_end_io(struct request *rq, int error)
-{
- elv_completed_request(rq->q, rq);
- blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
-}
-
-static void queue_flush(struct request_queue *q, struct request *rq,
- rq_end_io_fn *end_io)
-{
- blk_rq_init(q, rq);
- rq->cmd_type = REQ_TYPE_FS;
- rq->cmd_flags = REQ_FLUSH;
- rq->rq_disk = q->orig_bar_rq->rq_disk;
- rq->end_io = end_io;
-
- elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
-}
-
-static struct request *queue_next_ordseq(struct request_queue *q)
-{
- struct request *rq = &q->bar_rq;
-
- switch (blk_ordered_cur_seq(q)) {
- case QUEUE_ORDSEQ_PREFLUSH:
- queue_flush(q, rq, pre_flush_end_io);
- break;
-
- case QUEUE_ORDSEQ_BAR:
- /* initialize proxy request and queue it */
- blk_rq_init(q, rq);
- init_request_from_bio(rq, q->orig_bar_rq->bio);
- rq->cmd_flags &= ~REQ_HARDBARRIER;
- if (q->ordered & QUEUE_ORDERED_DO_FUA)
- rq->cmd_flags |= REQ_FUA;
- rq->end_io = bar_end_io;
-
- elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
- break;
-
- case QUEUE_ORDSEQ_POSTFLUSH:
- queue_flush(q, rq, post_flush_end_io);
- break;
-
- default:
- BUG();
- }
- return rq;
-}
-
-struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
-{
- unsigned skip = 0;
-
- if (!(rq->cmd_flags & REQ_HARDBARRIER))
- return rq;
-
- if (q->ordseq) {
- /*
- * Barrier is already in progress and they can't be
- * processed in parallel. Queue for later processing.
- */
- list_move_tail(&rq->queuelist, &q->pending_barriers);
- return NULL;
- }
-
- if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) {
- /*
- * Queue ordering not supported. Terminate
- * with prejudice.
- */
- blk_dequeue_request(rq);
- __blk_end_request_all(rq, -EOPNOTSUPP);
- return NULL;
- }
-
- /*
- * Start a new ordered sequence
- */
- q->orderr = 0;
- q->ordered = q->next_ordered;
- q->ordseq |= QUEUE_ORDSEQ_STARTED;
-
- /*
- * For an empty barrier, there's no actual BAR request, which
- * in turn makes POSTFLUSH unnecessary. Mask them off.
- */
- if (!blk_rq_sectors(rq))
- q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
- QUEUE_ORDERED_DO_POSTFLUSH);
-
- /* stash away the original request */
- blk_dequeue_request(rq);
- q->orig_bar_rq = rq;
-
- if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
- skip |= QUEUE_ORDSEQ_PREFLUSH;
-
- if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
- skip |= QUEUE_ORDSEQ_BAR;
-
- if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
- skip |= QUEUE_ORDSEQ_POSTFLUSH;
-
- /* complete skipped sequences and return the first sequence */
- return blk_ordered_complete_seq(q, skip, 0);
-}
-
-static void bio_end_empty_barrier(struct bio *bio, int err)
-{
- if (err) {
- if (err == -EOPNOTSUPP)
- set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
- clear_bit(BIO_UPTODATE, &bio->bi_flags);
- }
- if (bio->bi_private)
- complete(bio->bi_private);
- bio_put(bio);
-}
-
-/**
- * blkdev_issue_flush - queue a flush
- * @bdev: blockdev to issue flush for
- * @gfp_mask: memory allocation flags (for bio_alloc)
- * @error_sector: error sector
- * @flags: BLKDEV_IFL_* flags to control behaviour
- *
- * Description:
- * Issue a flush for the block device in question. Caller can supply
- * room for storing the error offset in case of a flush error, if they
- * wish to. If WAIT flag is not passed then caller may check only what
- * request was pushed in some internal queue for later handling.
- */
-int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
- sector_t *error_sector, unsigned long flags)
-{
- DECLARE_COMPLETION_ONSTACK(wait);
- struct request_queue *q;
- struct bio *bio;
- int ret = 0;
-
- if (bdev->bd_disk == NULL)
- return -ENXIO;
-
- q = bdev_get_queue(bdev);
- if (!q)
- return -ENXIO;
-
- /*
- * some block devices may not have their queue correctly set up here
- * (e.g. loop device without a backing file) and so issuing a flush
- * here will panic. Ensure there is a request function before issuing
- * the barrier.
- */
- if (!q->make_request_fn)
- return -ENXIO;
-
- bio = bio_alloc(gfp_mask, 0);
- bio->bi_end_io = bio_end_empty_barrier;
- bio->bi_bdev = bdev;
- if (test_bit(BLKDEV_WAIT, &flags))
- bio->bi_private = &wait;
-
- bio_get(bio);
- submit_bio(WRITE_BARRIER, bio);
- if (test_bit(BLKDEV_WAIT, &flags)) {
- wait_for_completion(&wait);
- /*
- * The driver must store the error location in ->bi_sector, if
- * it supports it. For non-stacked drivers, this should be
- * copied from blk_rq_pos(rq).
- */
- if (error_sector)
- *error_sector = bio->bi_sector;
- }
-
- if (bio_flagged(bio, BIO_EOPNOTSUPP))
- ret = -EOPNOTSUPP;
- else if (!bio_flagged(bio, BIO_UPTODATE))
- ret = -EIO;
-
- bio_put(bio);
- return ret;
-}
-EXPORT_SYMBOL(blkdev_issue_flush);
diff --git a/block/blk-flush.c b/block/blk-flush.c
new file mode 100644
index 0000000..e8b2e5c
--- /dev/null
+++ b/block/blk-flush.c
@@ -0,0 +1,248 @@
+/*
+ * Functions related to barrier IO handling
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/gfp.h>
+
+#include "blk.h"
+
+static struct request *queue_next_ordseq(struct request_queue *q);
+
+/*
+ * Cache flushing for ordered writes handling
+ */
+unsigned blk_ordered_cur_seq(struct request_queue *q)
+{
+ if (!q->ordseq)
+ return 0;
+ return 1 << ffz(q->ordseq);
+}
+
+static struct request *blk_ordered_complete_seq(struct request_queue *q,
+ unsigned seq, int error)
+{
+ struct request *next_rq = NULL;
+
+ if (error && !q->orderr)
+ q->orderr = error;
+
+ BUG_ON(q->ordseq & seq);
+ q->ordseq |= seq;
+
+ if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
+ /* not complete yet, queue the next ordered sequence */
+ next_rq = queue_next_ordseq(q);
+ } else {
+ /* complete this barrier request */
+ __blk_end_request_all(q->orig_bar_rq, q->orderr);
+ q->orig_bar_rq = NULL;
+ q->ordseq = 0;
+
+ /* dispatch the next barrier if there's one */
+ if (!list_empty(&q->pending_barriers)) {
+ next_rq = list_entry_rq(q->pending_barriers.next);
+ list_move(&next_rq->queuelist, &q->queue_head);
+ }
+ }
+ return next_rq;
+}
+
+static void pre_flush_end_io(struct request *rq, int error)
+{
+ elv_completed_request(rq->q, rq);
+ blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_PREFLUSH, error);
+}
+
+static void bar_end_io(struct request *rq, int error)
+{
+ elv_completed_request(rq->q, rq);
+ blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_BAR, error);
+}
+
+static void post_flush_end_io(struct request *rq, int error)
+{
+ elv_completed_request(rq->q, rq);
+ blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
+}
+
+static void queue_flush(struct request_queue *q, struct request *rq,
+ rq_end_io_fn *end_io)
+{
+ blk_rq_init(q, rq);
+ rq->cmd_type = REQ_TYPE_FS;
+ rq->cmd_flags = REQ_FLUSH;
+ rq->rq_disk = q->orig_bar_rq->rq_disk;
+ rq->end_io = end_io;
+
+ elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
+}
+
+static struct request *queue_next_ordseq(struct request_queue *q)
+{
+ struct request *rq = &q->bar_rq;
+
+ switch (blk_ordered_cur_seq(q)) {
+ case QUEUE_ORDSEQ_PREFLUSH:
+ queue_flush(q, rq, pre_flush_end_io);
+ break;
+
+ case QUEUE_ORDSEQ_BAR:
+ /* initialize proxy request and queue it */
+ blk_rq_init(q, rq);
+ init_request_from_bio(rq, q->orig_bar_rq->bio);
+ rq->cmd_flags &= ~REQ_HARDBARRIER;
+ if (q->ordered & QUEUE_ORDERED_DO_FUA)
+ rq->cmd_flags |= REQ_FUA;
+ rq->end_io = bar_end_io;
+
+ elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
+ break;
+
+ case QUEUE_ORDSEQ_POSTFLUSH:
+ queue_flush(q, rq, post_flush_end_io);
+ break;
+
+ default:
+ BUG();
+ }
+ return rq;
+}
+
+struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
+{
+ unsigned skip = 0;
+
+ if (!(rq->cmd_flags & REQ_HARDBARRIER))
+ return rq;
+
+ if (q->ordseq) {
+ /*
+ * Barrier is already in progress and they can't be
+ * processed in parallel. Queue for later processing.
+ */
+ list_move_tail(&rq->queuelist, &q->pending_barriers);
+ return NULL;
+ }
+
+ if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) {
+ /*
+ * Queue ordering not supported. Terminate
+ * with prejudice.
+ */
+ blk_dequeue_request(rq);
+ __blk_end_request_all(rq, -EOPNOTSUPP);
+ return NULL;
+ }
+
+ /*
+ * Start a new ordered sequence
+ */
+ q->orderr = 0;
+ q->ordered = q->next_ordered;
+ q->ordseq |= QUEUE_ORDSEQ_STARTED;
+
+ /*
+ * For an empty barrier, there's no actual BAR request, which
+ * in turn makes POSTFLUSH unnecessary. Mask them off.
+ */
+ if (!blk_rq_sectors(rq))
+ q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
+ QUEUE_ORDERED_DO_POSTFLUSH);
+
+ /* stash away the original request */
+ blk_dequeue_request(rq);
+ q->orig_bar_rq = rq;
+
+ if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
+ skip |= QUEUE_ORDSEQ_PREFLUSH;
+
+ if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
+ skip |= QUEUE_ORDSEQ_BAR;
+
+ if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
+ skip |= QUEUE_ORDSEQ_POSTFLUSH;
+
+ /* complete skipped sequences and return the first sequence */
+ return blk_ordered_complete_seq(q, skip, 0);
+}
+
+static void bio_end_empty_barrier(struct bio *bio, int err)
+{
+ if (err) {
+ if (err == -EOPNOTSUPP)
+ set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
+ clear_bit(BIO_UPTODATE, &bio->bi_flags);
+ }
+ if (bio->bi_private)
+ complete(bio->bi_private);
+ bio_put(bio);
+}
+
+/**
+ * blkdev_issue_flush - queue a flush
+ * @bdev: blockdev to issue flush for
+ * @gfp_mask: memory allocation flags (for bio_alloc)
+ * @error_sector: error sector
+ * @flags: BLKDEV_IFL_* flags to control behaviour
+ *
+ * Description:
+ * Issue a flush for the block device in question. Caller can supply
+ * room for storing the error offset in case of a flush error, if they
+ * wish to. If WAIT flag is not passed then caller may check only what
+ * request was pushed in some internal queue for later handling.
+ */
+int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
+ sector_t *error_sector, unsigned long flags)
+{
+ DECLARE_COMPLETION_ONSTACK(wait);
+ struct request_queue *q;
+ struct bio *bio;
+ int ret = 0;
+
+ if (bdev->bd_disk == NULL)
+ return -ENXIO;
+
+ q = bdev_get_queue(bdev);
+ if (!q)
+ return -ENXIO;
+
+ /*
+ * some block devices may not have their queue correctly set up here
+ * (e.g. loop device without a backing file) and so issuing a flush
+ * here will panic. Ensure there is a request function before issuing
+ * the barrier.
+ */
+ if (!q->make_request_fn)
+ return -ENXIO;
+
+ bio = bio_alloc(gfp_mask, 0);
+ bio->bi_end_io = bio_end_empty_barrier;
+ bio->bi_bdev = bdev;
+ if (test_bit(BLKDEV_WAIT, &flags))
+ bio->bi_private = &wait;
+
+ bio_get(bio);
+ submit_bio(WRITE_BARRIER, bio);
+ if (test_bit(BLKDEV_WAIT, &flags)) {
+ wait_for_completion(&wait);
+ /*
+ * The driver must store the error location in ->bi_sector, if
+ * it supports it. For non-stacked drivers, this should be
+ * copied from blk_rq_pos(rq).
+ */
+ if (error_sector)
+ *error_sector = bio->bi_sector;
+ }
+
+ if (bio_flagged(bio, BIO_EOPNOTSUPP))
+ ret = -EOPNOTSUPP;
+ else if (!bio_flagged(bio, BIO_UPTODATE))
+ ret = -EIO;
+
+ bio_put(bio);
+ return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_flush);
--
1.7.1

2010-08-25 15:59:30

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 13/30] block: simplify queue_next_fseq

From: Christoph Hellwig <[email protected]>

We need to call blk_rq_init and elv_insert for all cases in queue_next_fseq,
so take these calls into common code. Also move the end_io initialization
from queue_flush into queue_next_fseq and rename queue_flush to
init_flush_request now that it's old name doesn't apply anymore.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-flush.c | 26 ++++++++++----------------
1 files changed, 10 insertions(+), 16 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index ab765c2..4e96e18 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -74,16 +74,11 @@ static void post_flush_end_io(struct request *rq, int error)
blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
}

-static void queue_flush(struct request_queue *q, struct request *rq,
- rq_end_io_fn *end_io)
+static void init_flush_request(struct request *rq, struct gendisk *disk)
{
- blk_rq_init(q, rq);
rq->cmd_type = REQ_TYPE_FS;
rq->cmd_flags = REQ_FLUSH;
- rq->rq_disk = q->orig_flush_rq->rq_disk;
- rq->end_io = end_io;
-
- elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
+ rq->rq_disk = disk;
}

static struct request *queue_next_fseq(struct request_queue *q)
@@ -91,29 +86,28 @@ static struct request *queue_next_fseq(struct request_queue *q)
struct request *orig_rq = q->orig_flush_rq;
struct request *rq = &q->flush_rq;

+ blk_rq_init(q, rq);
+
switch (blk_flush_cur_seq(q)) {
case QUEUE_FSEQ_PREFLUSH:
- queue_flush(q, rq, pre_flush_end_io);
+ init_flush_request(rq, orig_rq->rq_disk);
+ rq->end_io = pre_flush_end_io;
break;
-
case QUEUE_FSEQ_DATA:
- /* initialize proxy request, inherit FLUSH/FUA and queue it */
- blk_rq_init(q, rq);
init_request_from_bio(rq, orig_rq->bio);
rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
rq->end_io = flush_data_end_io;
-
- elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
break;
-
case QUEUE_FSEQ_POSTFLUSH:
- queue_flush(q, rq, post_flush_end_io);
+ init_flush_request(rq, orig_rq->rq_disk);
+ rq->end_io = post_flush_end_io;
break;
-
default:
BUG();
}
+
+ elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
return rq;
}

--
1.7.1

2010-08-25 15:59:39

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/30] block: rename barrier/ordered to flush

With ordering requirements dropped, barrier and ordered are misnomers.
Now all block layer does is sequencing FLUSH and FUA. Rename them to
flush.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
block/blk-core.c | 21 +++++-----
block/blk-flush.c | 98 +++++++++++++++++++++++------------------------
block/blk.h | 4 +-
include/linux/blkdev.h | 24 ++++++------
4 files changed, 72 insertions(+), 75 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d316662..8870ae4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -136,7 +136,7 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
{
struct request_queue *q = rq->q;

- if (&q->bar_rq != rq) {
+ if (&q->flush_rq != rq) {
if (error)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
@@ -160,13 +160,12 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
if (bio->bi_size == 0)
bio_endio(bio, error);
} else {
-
/*
- * Okay, this is the barrier request in progress, just
- * record the error;
+ * Okay, this is the sequenced flush request in
+ * progress, just record the error;
*/
- if (error && !q->orderr)
- q->orderr = error;
+ if (error && !q->flush_err)
+ q->flush_err = error;
}
}

@@ -520,7 +519,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
init_timer(&q->unplug_timer);
setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
INIT_LIST_HEAD(&q->timeout_list);
- INIT_LIST_HEAD(&q->pending_barriers);
+ INIT_LIST_HEAD(&q->pending_flushes);
INIT_WORK(&q->unplug_work, blk_unplug_work);

kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1764,11 +1763,11 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
static void blk_account_io_done(struct request *req)
{
/*
- * Account IO completion. bar_rq isn't accounted as a normal
- * IO on queueing nor completion. Accounting the containing
- * request is enough.
+ * Account IO completion. flush_rq isn't accounted as a
+ * normal IO on queueing nor completion. Accounting the
+ * containing request is enough.
*/
- if (blk_do_io_stat(req) && req != &req->q->bar_rq) {
+ if (blk_do_io_stat(req) && req != &req->q->flush_rq) {
unsigned long duration = jiffies - req->start_time;
const int rw = rq_data_dir(req);
struct hd_struct *part;
diff --git a/block/blk-flush.c b/block/blk-flush.c
index e8b2e5c..dd87322 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -9,41 +9,38 @@

#include "blk.h"

-static struct request *queue_next_ordseq(struct request_queue *q);
+static struct request *queue_next_fseq(struct request_queue *q);

-/*
- * Cache flushing for ordered writes handling
- */
-unsigned blk_ordered_cur_seq(struct request_queue *q)
+unsigned blk_flush_cur_seq(struct request_queue *q)
{
- if (!q->ordseq)
+ if (!q->flush_seq)
return 0;
- return 1 << ffz(q->ordseq);
+ return 1 << ffz(q->flush_seq);
}

-static struct request *blk_ordered_complete_seq(struct request_queue *q,
- unsigned seq, int error)
+static struct request *blk_flush_complete_seq(struct request_queue *q,
+ unsigned seq, int error)
{
struct request *next_rq = NULL;

- if (error && !q->orderr)
- q->orderr = error;
+ if (error && !q->flush_err)
+ q->flush_err = error;

- BUG_ON(q->ordseq & seq);
- q->ordseq |= seq;
+ BUG_ON(q->flush_seq & seq);
+ q->flush_seq |= seq;

- if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
- /* not complete yet, queue the next ordered sequence */
- next_rq = queue_next_ordseq(q);
+ if (blk_flush_cur_seq(q) != QUEUE_FSEQ_DONE) {
+ /* not complete yet, queue the next flush sequence */
+ next_rq = queue_next_fseq(q);
} else {
- /* complete this barrier request */
- __blk_end_request_all(q->orig_bar_rq, q->orderr);
- q->orig_bar_rq = NULL;
- q->ordseq = 0;
-
- /* dispatch the next barrier if there's one */
- if (!list_empty(&q->pending_barriers)) {
- next_rq = list_entry_rq(q->pending_barriers.next);
+ /* complete this flush request */
+ __blk_end_request_all(q->orig_flush_rq, q->flush_err);
+ q->orig_flush_rq = NULL;
+ q->flush_seq = 0;
+
+ /* dispatch the next flush if there's one */
+ if (!list_empty(&q->pending_flushes)) {
+ next_rq = list_entry_rq(q->pending_flushes.next);
list_move(&next_rq->queuelist, &q->queue_head);
}
}
@@ -53,19 +50,19 @@ static struct request *blk_ordered_complete_seq(struct request_queue *q,
static void pre_flush_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_PREFLUSH, error);
+ blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error);
}

-static void bar_end_io(struct request *rq, int error)
+static void flush_data_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_BAR, error);
+ blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error);
}

static void post_flush_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
+ blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
}

static void queue_flush(struct request_queue *q, struct request *rq,
@@ -74,34 +71,34 @@ static void queue_flush(struct request_queue *q, struct request *rq,
blk_rq_init(q, rq);
rq->cmd_type = REQ_TYPE_FS;
rq->cmd_flags = REQ_FLUSH;
- rq->rq_disk = q->orig_bar_rq->rq_disk;
+ rq->rq_disk = q->orig_flush_rq->rq_disk;
rq->end_io = end_io;

elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
}

-static struct request *queue_next_ordseq(struct request_queue *q)
+static struct request *queue_next_fseq(struct request_queue *q)
{
- struct request *rq = &q->bar_rq;
+ struct request *rq = &q->flush_rq;

- switch (blk_ordered_cur_seq(q)) {
- case QUEUE_ORDSEQ_PREFLUSH:
+ switch (blk_flush_cur_seq(q)) {
+ case QUEUE_FSEQ_PREFLUSH:
queue_flush(q, rq, pre_flush_end_io);
break;

- case QUEUE_ORDSEQ_BAR:
+ case QUEUE_FSEQ_DATA:
/* initialize proxy request and queue it */
blk_rq_init(q, rq);
- init_request_from_bio(rq, q->orig_bar_rq->bio);
+ init_request_from_bio(rq, q->orig_flush_rq->bio);
rq->cmd_flags &= ~REQ_HARDBARRIER;
if (q->ordered & QUEUE_ORDERED_DO_FUA)
rq->cmd_flags |= REQ_FUA;
- rq->end_io = bar_end_io;
+ rq->end_io = flush_data_end_io;

elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
break;

- case QUEUE_ORDSEQ_POSTFLUSH:
+ case QUEUE_FSEQ_POSTFLUSH:
queue_flush(q, rq, post_flush_end_io);
break;

@@ -111,19 +108,20 @@ static struct request *queue_next_ordseq(struct request_queue *q)
return rq;
}

-struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
+struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned skip = 0;

if (!(rq->cmd_flags & REQ_HARDBARRIER))
return rq;

- if (q->ordseq) {
+ if (q->flush_seq) {
/*
- * Barrier is already in progress and they can't be
- * processed in parallel. Queue for later processing.
+ * Sequenced flush is already in progress and they
+ * can't be processed in parallel. Queue for later
+ * processing.
*/
- list_move_tail(&rq->queuelist, &q->pending_barriers);
+ list_move_tail(&rq->queuelist, &q->pending_flushes);
return NULL;
}

@@ -138,11 +136,11 @@ struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
}

/*
- * Start a new ordered sequence
+ * Start a new flush sequence
*/
- q->orderr = 0;
+ q->flush_err = 0;
q->ordered = q->next_ordered;
- q->ordseq |= QUEUE_ORDSEQ_STARTED;
+ q->flush_seq |= QUEUE_FSEQ_STARTED;

/*
* For an empty barrier, there's no actual BAR request, which
@@ -154,19 +152,19 @@ struct request *blk_do_ordered(struct request_queue *q, struct request *rq)

/* stash away the original request */
blk_dequeue_request(rq);
- q->orig_bar_rq = rq;
+ q->orig_flush_rq = rq;

if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
- skip |= QUEUE_ORDSEQ_PREFLUSH;
+ skip |= QUEUE_FSEQ_PREFLUSH;

if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
- skip |= QUEUE_ORDSEQ_BAR;
+ skip |= QUEUE_FSEQ_DATA;

if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
- skip |= QUEUE_ORDSEQ_POSTFLUSH;
+ skip |= QUEUE_FSEQ_POSTFLUSH;

/* complete skipped sequences and return the first sequence */
- return blk_ordered_complete_seq(q, skip, 0);
+ return blk_flush_complete_seq(q, skip, 0);
}

static void bio_end_empty_barrier(struct bio *bio, int err)
diff --git a/block/blk.h b/block/blk.h
index 08081e4..24b92bd 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -51,7 +51,7 @@ static inline void blk_clear_rq_complete(struct request *rq)
*/
#define ELV_ON_HASH(rq) (!hlist_unhashed(&(rq)->hash))

-struct request *blk_do_ordered(struct request_queue *q, struct request *rq);
+struct request *blk_do_flush(struct request_queue *q, struct request *rq);

static inline struct request *__elv_next_request(struct request_queue *q)
{
@@ -60,7 +60,7 @@ static inline struct request *__elv_next_request(struct request_queue *q)
while (1) {
while (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- rq = blk_do_ordered(q, rq);
+ rq = blk_do_flush(q, rq);
if (rq)
return rq;
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 20a3710..1cd83ec 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -357,13 +357,13 @@ struct request_queue
/*
* for flush operations
*/
+ unsigned int ordered, next_ordered;
unsigned int flush_flags;
-
- unsigned int ordered, next_ordered, ordseq;
- int orderr;
- struct request bar_rq;
- struct request *orig_bar_rq;
- struct list_head pending_barriers;
+ unsigned int flush_seq;
+ int flush_err;
+ struct request flush_rq;
+ struct request *orig_flush_rq;
+ struct list_head pending_flushes;

struct mutex sysfs_lock;

@@ -490,13 +490,13 @@ enum {
QUEUE_ORDERED_DO_FUA,

/*
- * Ordered operation sequence
+ * FLUSH/FUA sequences.
*/
- QUEUE_ORDSEQ_STARTED = (1 << 0), /* flushing in progress */
- QUEUE_ORDSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */
- QUEUE_ORDSEQ_BAR = (1 << 2), /* barrier write in progress */
- QUEUE_ORDSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */
- QUEUE_ORDSEQ_DONE = (1 << 4),
+ QUEUE_FSEQ_STARTED = (1 << 0), /* flushing in progress */
+ QUEUE_FSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */
+ QUEUE_FSEQ_DATA = (1 << 2), /* data write in progress */
+ QUEUE_FSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */
+ QUEUE_FSEQ_DONE = (1 << 4),
};

#define blk_queue_plugged(q) test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
--
1.7.1

2010-08-25 15:59:43

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 30/30] block: remove the BH_Eopnotsupp flag

From: Christoph Hellwig <[email protected]>

This flag was only set for barrier buffers, which we don't submit
anymore.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
fs/buffer.c | 7 +------
fs/fat/misc.c | 5 +----
include/linux/buffer_head.h | 2 --
3 files changed, 2 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 3e7dca2..7f0b9b0 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -156,7 +156,7 @@ void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
if (uptodate) {
set_buffer_uptodate(bh);
} else {
- if (!buffer_eopnotsupp(bh) && !quiet_error(bh)) {
+ if (!quiet_error(bh)) {
buffer_io_error(bh);
printk(KERN_WARNING "lost page write due to "
"I/O error on %s\n",
@@ -2891,7 +2891,6 @@ static void end_bio_bh_io_sync(struct bio *bio, int err)

if (err == -EOPNOTSUPP) {
set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
- set_bit(BH_Eopnotsupp, &bh->b_state);
}

if (unlikely (test_bit(BIO_QUIET,&bio->bi_flags)))
@@ -3031,10 +3030,6 @@ int __sync_dirty_buffer(struct buffer_head *bh, int rw)
bh->b_end_io = end_buffer_write_sync;
ret = submit_bh(rw, bh);
wait_on_buffer(bh);
- if (buffer_eopnotsupp(bh)) {
- clear_buffer_eopnotsupp(bh);
- ret = -EOPNOTSUPP;
- }
if (!ret && !buffer_uptodate(bh))
ret = -EIO;
} else {
diff --git a/fs/fat/misc.c b/fs/fat/misc.c
index 1736f23..970e682 100644
--- a/fs/fat/misc.c
+++ b/fs/fat/misc.c
@@ -255,10 +255,7 @@ int fat_sync_bhs(struct buffer_head **bhs, int nr_bhs)

for (i = 0; i < nr_bhs; i++) {
wait_on_buffer(bhs[i]);
- if (buffer_eopnotsupp(bhs[i])) {
- clear_buffer_eopnotsupp(bhs[i]);
- err = -EOPNOTSUPP;
- } else if (!err && !buffer_uptodate(bhs[i]))
+ if (!err && !buffer_uptodate(bhs[i]))
err = -EIO;
}
return err;
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index fc999f5..dd1b25b 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -32,7 +32,6 @@ enum bh_state_bits {
BH_Delay, /* Buffer is not yet allocated on disk */
BH_Boundary, /* Block is followed by a discontiguity */
BH_Write_EIO, /* I/O error on write */
- BH_Eopnotsupp, /* DEPRECATED: operation not supported (barrier) */
BH_Unwritten, /* Buffer is allocated on disk but not written */
BH_Quiet, /* Buffer Error Prinks to be quiet */

@@ -124,7 +123,6 @@ BUFFER_FNS(Async_Write, async_write)
BUFFER_FNS(Delay, delay)
BUFFER_FNS(Boundary, boundary)
BUFFER_FNS(Write_EIO, write_io_error)
-BUFFER_FNS(Eopnotsupp, eopnotsupp)
BUFFER_FNS(Unwritten, unwritten)

#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
--
1.7.1

2010-08-25 15:59:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 27/30] fat: do not send discards as barriers

From: Christoph Hellwig <[email protected]>

fat already uses synchronous discards, no need to add I/O barriers.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
fs/fat/fatent.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c
index 3a56a82..f9a0b7a 100644
--- a/fs/fat/fatent.c
+++ b/fs/fat/fatent.c
@@ -579,7 +579,7 @@ int fat_free_clusters(struct inode *inode, int cluster)
fat_clus_to_blknr(sbi, first_cl),
nr_clus * sbi->sec_per_clus,
GFP_NOFS,
- BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+ BLKDEV_IFL_WAIT);

first_cl = cluster;
}
--
1.7.1

2010-08-25 16:00:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 29/30] block: remove the BLKDEV_IFL_BARRIER flag

On Wed, Aug 25, 2010 at 05:47:46PM +0200, Tejun Heo wrote:
> From: Christoph Hellwig <[email protected]>
>
> Remove support for barriers on discards, which is unused now. Also
> remove the DISCARD_NOBARRIER I/O type in favour of just setting the
> rw flags up locally in blkdev_issue_discard.
>
> tj: Also remove DISCARD_SECURE and use REQ_SECURE directly.

REQ_SECURE was just added for mmc. I assume they plan to submit the
driver support soon.

2010-08-25 16:00:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

On Wed, Aug 25, 2010 at 05:47:43PM +0200, Tejun Heo wrote:
> From: Christoph Hellwig <[email protected]>
>
> ext4 already uses synchronous discards, no need to add I/O barriers.

This needs the patch that Jan sent in reply to my initial version merged
into it.

2010-08-25 16:01:13

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

On Wed, Aug 25, 2010 at 05:58:42PM +0200, Christoph Hellwig wrote:
> On Wed, Aug 25, 2010 at 05:47:43PM +0200, Tejun Heo wrote:
> > From: Christoph Hellwig <[email protected]>
> >
> > ext4 already uses synchronous discards, no need to add I/O barriers.
>
> This needs the patch that Jan sent in reply to my initial version merged
> into it.

Actually the jbd2 patch needs it merged, but the point still stands.

2010-08-25 15:59:15

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 01/30] ide: remove unnecessary blk_queue_flushing() test in do_ide_request()

Unplugging from a request function doesn't really help much (it's
already in the request_fn) and soon block layer will be updated to mix
barrier sequence with other commands, so there's no need to treat
queue flushing any differently.

ide was the only user of blk_queue_flushing(). Remove it.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Acked-by: David S. Miller <[email protected]>
---
drivers/ide/ide-io.c | 13 -------------
include/linux/blkdev.h | 1 -
2 files changed, 0 insertions(+), 14 deletions(-)

diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
index a381be8..999dac0 100644
--- a/drivers/ide/ide-io.c
+++ b/drivers/ide/ide-io.c
@@ -441,19 +441,6 @@ void do_ide_request(struct request_queue *q)
struct request *rq = NULL;
ide_startstop_t startstop;

- /*
- * drive is doing pre-flush, ordered write, post-flush sequence. even
- * though that is 3 requests, it must be seen as a single transaction.
- * we must not preempt this drive until that is complete
- */
- if (blk_queue_flushing(q))
- /*
- * small race where queue could get replugged during
- * the 3-request flush cycle, just yank the plug since
- * we want it to finish asap
- */
- blk_remove_plug(q);
-
spin_unlock_irq(q->queue_lock);

/* HLD do_request() callback might sleep, make sure it's okay */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2c54906..015375c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -521,7 +521,6 @@ enum {
#define blk_queue_nonrot(q) test_bit(QUEUE_FLAG_NONROT, &(q)->queue_flags)
#define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
#define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
-#define blk_queue_flushing(q) ((q)->ordseq)
#define blk_queue_stackable(q) \
test_bit(QUEUE_FLAG_STACKABLE, &(q)->queue_flags)
#define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
--
1.7.1

2010-08-25 15:58:55

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 15/30] virtio_blk: drop REQ_HARDBARRIER support

Remove now unused REQ_HARDBARRIER support. virtio_blk already
supports REQ_FLUSH and the usefulness of REQ_FUA for virtio_blk is
questionable at this point, so there's nothing else to do to support
new REQ_FLUSH/FUA interface.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
drivers/block/virtio_blk.c | 17 ++++-------------
1 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index d10b635..1260628 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -128,9 +128,6 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
}
}

- if (vbr->req->cmd_flags & REQ_HARDBARRIER)
- vbr->out_hdr.type |= VIRTIO_BLK_T_BARRIER;
-
sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));

/*
@@ -388,13 +385,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
vblk->disk->driverfs_dev = &vdev->dev;
index++;

- /*
- * If the FLUSH feature is supported we do have support for
- * flushing a volatile write cache on the host. Use that to
- * implement write barrier support; otherwise, we must assume
- * that the host does not perform any kind of volatile write
- * caching.
- */
+ /* configure queue flush support */
if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
blk_queue_flush(q, REQ_FLUSH);

@@ -515,9 +506,9 @@ static const struct virtio_device_id id_table[] = {
};

static unsigned int features[] = {
- VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
- VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
- VIRTIO_BLK_F_SCSI, VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY
+ VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
+ VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_SCSI,
+ VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY
};

/*
--
1.7.1

2010-08-25 16:03:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

On 08/25/2010 06:00 PM, Christoph Hellwig wrote:
> On Wed, Aug 25, 2010 at 05:58:42PM +0200, Christoph Hellwig wrote:
>> On Wed, Aug 25, 2010 at 05:47:43PM +0200, Tejun Heo wrote:
>>> From: Christoph Hellwig <[email protected]>
>>>
>>> ext4 already uses synchronous discards, no need to add I/O barriers.
>>
>> This needs the patch that Jan sent in reply to my initial version merged
>> into it.
>
> Actually the jbd2 patch needs it merged, but the point still stands.

Yeah, wasn't sure about that one. Has anyone tested it? I'll be
happy to merge it but I have no idea whether it's correct or not and
Jan didn't seem to have tested it so... Jan, shall I merge the patch?

Thanks.

--
tejun

2010-08-25 16:03:38

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCHSET 2.6.36-rc2] block, fs: replace HARDBARRIER with FLUSH/FUA

On Wed, Aug 25 2010 at 11:47am -0400,
Tejun Heo <[email protected]> wrote:

> Hello,
>
> This patchset is combination of the following three patchset.
>
> [1] block: replace barrier with sequenced flush
> [2] block: convert to REQ_FLUSH/FUA
> [3] replace barriers with explicit flush / FUA usage
>
> Changes from the previous postings are,
>
> * Rebased on top of 2.6.36-rc2 (502adf5778f4151dcba3f64dd6ed322151f3712c)

Awesome, thanks!

> * dm conversion is excluded for now.
...
> I've audited all make_request drivers and after this patchset only
> blktrace, dm, drbd and xen need more work. I'll work on blktrace and
> dm

OK, once you have the DM patch refreshed I'll jump in with further
changes (SCSI error differentiation in mpath) and more careful review of
the DM changes (both bio-based and request-based).

Thanks again!
Mike

2010-08-25 15:59:01

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 12/30] block: use REQ_FLUSH in blkdev_issue_flush()

Update blkdev_issue_flush() to use new REQ_FLUSH interface.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
block/blk-flush.c | 17 ++++++-----------
1 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 452c552..ab765c2 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -174,13 +174,10 @@ struct request *blk_do_flush(struct request_queue *q, struct request *rq)
return blk_flush_complete_seq(q, skip, 0);
}

-static void bio_end_empty_barrier(struct bio *bio, int err)
+static void bio_end_flush(struct bio *bio, int err)
{
- if (err) {
- if (err == -EOPNOTSUPP)
- set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
+ if (err)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
- }
if (bio->bi_private)
complete(bio->bi_private);
bio_put(bio);
@@ -218,19 +215,19 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
* some block devices may not have their queue correctly set up here
* (e.g. loop device without a backing file) and so issuing a flush
* here will panic. Ensure there is a request function before issuing
- * the barrier.
+ * the flush.
*/
if (!q->make_request_fn)
return -ENXIO;

bio = bio_alloc(gfp_mask, 0);
- bio->bi_end_io = bio_end_empty_barrier;
+ bio->bi_end_io = bio_end_flush;
bio->bi_bdev = bdev;
if (test_bit(BLKDEV_WAIT, &flags))
bio->bi_private = &wait;

bio_get(bio);
- submit_bio(WRITE_BARRIER, bio);
+ submit_bio(WRITE_FLUSH, bio);
if (test_bit(BLKDEV_WAIT, &flags)) {
wait_for_completion(&wait);
/*
@@ -242,9 +239,7 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
*error_sector = bio->bi_sector;
}

- if (bio_flagged(bio, BIO_EOPNOTSUPP))
- ret = -EOPNOTSUPP;
- else if (!bio_flagged(bio, BIO_UPTODATE))
+ if (!bio_flagged(bio, BIO_UPTODATE))
ret = -EIO;

bio_put(bio);
--
1.7.1

2010-08-25 15:53:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 21/30] gfs2: replace barriers with explicit flush / FUA usage

From: Christoph Hellwig <[email protected]>

Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP
detection for barriers and stop setting the barrier flag for discards.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Steven Whitehouse <[email protected]>
Acked-by: Bob Peterson <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
fs/gfs2/log.c | 19 +++++--------------
fs/gfs2/rgrp.c | 5 ++---
2 files changed, 7 insertions(+), 17 deletions(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index cde1248..9c65170 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -592,22 +592,13 @@ static void log_write_header(struct gfs2_sbd *sdp, u32 flags, int pull)
lh->lh_hash = cpu_to_be32(hash);

bh->b_end_io = end_buffer_write_sync;
- if (test_bit(SDF_NOBARRIERS, &sdp->sd_flags))
- goto skip_barrier;
get_bh(bh);
- submit_bh(WRITE_BARRIER | REQ_META, bh);
- wait_on_buffer(bh);
- if (buffer_eopnotsupp(bh)) {
- clear_buffer_eopnotsupp(bh);
- set_buffer_uptodate(bh);
- fs_info(sdp, "barrier sync failed - disabling barriers\n");
- set_bit(SDF_NOBARRIERS, &sdp->sd_flags);
- lock_buffer(bh);
-skip_barrier:
- get_bh(bh);
+ if (test_bit(SDF_NOBARRIERS, &sdp->sd_flags))
submit_bh(WRITE_SYNC | REQ_META, bh);
- wait_on_buffer(bh);
- }
+ else
+ submit_bh(WRITE_FLUSH_FUA | REQ_META, bh);
+ wait_on_buffer(bh);
+
if (!buffer_uptodate(bh))
gfs2_io_error_bh(sdp, bh);
brelse(bh);
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 171a744..3793164 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -854,8 +854,7 @@ static void gfs2_rgrp_send_discards(struct gfs2_sbd *sdp, u64 offset,
if ((start + nr_sects) != blk) {
rv = blkdev_issue_discard(bdev, start,
nr_sects, GFP_NOFS,
- BLKDEV_IFL_WAIT |
- BLKDEV_IFL_BARRIER);
+ BLKDEV_IFL_WAIT);
if (rv)
goto fail;
nr_sects = 0;
@@ -870,7 +869,7 @@ start_new_extent:
}
if (nr_sects) {
rv = blkdev_issue_discard(bdev, start, nr_sects, GFP_NOFS,
- BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+ BLKDEV_IFL_WAIT);
if (rv)
goto fail;
}
--
1.7.1

2010-08-25 15:54:05

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 25/30] jbd2: replace barriers with explicit flush / FUA usage

From: Christoph Hellwig <[email protected]>

Switch to the WRITE_FLUSH_FUA flag for journal commits and remove the
EOPNOTSUPP detection for barriers.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Jan Kara <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
fs/jbd2/commit.c | 43 ++++---------------------------------------
1 files changed, 4 insertions(+), 39 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 7c068c1..db99ecb 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -134,25 +134,11 @@ static int journal_submit_commit_record(journal_t *journal,

if (journal->j_flags & JBD2_BARRIER &&
!JBD2_HAS_INCOMPAT_FEATURE(journal,
- JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
- ret = submit_bh(WRITE_SYNC_PLUG | WRITE_BARRIER, bh);
- if (ret == -EOPNOTSUPP) {
- printk(KERN_WARNING
- "JBD2: Disabling barriers on %s, "
- "not supported by device\n", journal->j_devname);
- write_lock(&journal->j_state_lock);
- journal->j_flags &= ~JBD2_BARRIER;
- write_unlock(&journal->j_state_lock);
-
- /* And try again, without the barrier */
- lock_buffer(bh);
- set_buffer_uptodate(bh);
- clear_buffer_dirty(bh);
- ret = submit_bh(WRITE_SYNC_PLUG, bh);
- }
- } else {
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT))
+ ret = submit_bh(WRITE_SYNC_PLUG | WRITE_FLUSH_FUA, bh);
+ else
ret = submit_bh(WRITE_SYNC_PLUG, bh);
- }
+
*cbh = bh;
return ret;
}
@@ -166,29 +152,8 @@ static int journal_wait_on_commit_record(journal_t *journal,
{
int ret = 0;

-retry:
clear_buffer_dirty(bh);
wait_on_buffer(bh);
- if (buffer_eopnotsupp(bh) && (journal->j_flags & JBD2_BARRIER)) {
- printk(KERN_WARNING
- "JBD2: %s: disabling barries on %s - not supported "
- "by device\n", __func__, journal->j_devname);
- write_lock(&journal->j_state_lock);
- journal->j_flags &= ~JBD2_BARRIER;
- write_unlock(&journal->j_state_lock);
-
- lock_buffer(bh);
- clear_buffer_dirty(bh);
- set_buffer_uptodate(bh);
- bh->b_end_io = journal_end_buffer_io_sync;
-
- ret = submit_bh(WRITE_SYNC_PLUG, bh);
- if (ret) {
- unlock_buffer(bh);
- return ret;
- }
- goto retry;
- }

if (unlikely(!buffer_uptodate(bh)))
ret = -EIO;
--
1.7.1

2010-08-25 16:04:57

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 20/30] btrfs: replace barriers with explicit flush / FUA usage

From: Christoph Hellwig <[email protected]>

Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP
detection for barriers and stop setting the barrier flag for discards.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Chris Mason <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
fs/btrfs/disk-io.c | 19 ++++---------------
fs/btrfs/extent-tree.c | 2 +-
fs/btrfs/volumes.c | 4 ----
fs/btrfs/volumes.h | 1 -
4 files changed, 5 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 64f1008..5e789f4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2063,7 +2063,7 @@ static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate)
if (uptodate) {
set_buffer_uptodate(bh);
} else {
- if (!buffer_eopnotsupp(bh) && printk_ratelimit()) {
+ if (printk_ratelimit()) {
printk(KERN_WARNING "lost page write due to "
"I/O error on %s\n",
bdevname(bh->b_bdev, b));
@@ -2200,21 +2200,10 @@ static int write_dev_supers(struct btrfs_device *device,
bh->b_end_io = btrfs_end_buffer_write_sync;
}

- if (i == last_barrier && do_barriers && device->barriers) {
- ret = submit_bh(WRITE_BARRIER, bh);
- if (ret == -EOPNOTSUPP) {
- printk("btrfs: disabling barriers on dev %s\n",
- device->name);
- set_buffer_uptodate(bh);
- device->barriers = 0;
- /* one reference for submit_bh */
- get_bh(bh);
- lock_buffer(bh);
- ret = submit_bh(WRITE_SYNC, bh);
- }
- } else {
+ if (i == last_barrier && do_barriers)
+ ret = submit_bh(WRITE_FLUSH_FUA, bh);
+ else
ret = submit_bh(WRITE_SYNC, bh);
- }

if (ret)
errors++;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 32d0940..43dc9ea 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1696,7 +1696,7 @@ static void btrfs_issue_discard(struct block_device *bdev,
u64 start, u64 len)
{
blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_KERNEL,
- BLKDEV_IFL_WAIT | BLKDEV_IFL_BARRIER);
+ BLKDEV_IFL_WAIT);
}

static int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dd318ff..e25e46a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -398,7 +398,6 @@ static noinline int device_list_add(const char *path,
device->work.func = pending_bios_fn;
memcpy(device->uuid, disk_super->dev_item.uuid,
BTRFS_UUID_SIZE);
- device->barriers = 1;
spin_lock_init(&device->io_lock);
device->name = kstrdup(path, GFP_NOFS);
if (!device->name) {
@@ -462,7 +461,6 @@ static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig)
device->devid = orig_dev->devid;
device->work.func = pending_bios_fn;
memcpy(device->uuid, orig_dev->uuid, sizeof(device->uuid));
- device->barriers = 1;
spin_lock_init(&device->io_lock);
INIT_LIST_HEAD(&device->dev_list);
INIT_LIST_HEAD(&device->dev_alloc_list);
@@ -1489,7 +1487,6 @@ int btrfs_init_new_device(struct btrfs_root *root, char *device_path)
trans = btrfs_start_transaction(root, 0);
lock_chunks(root);

- device->barriers = 1;
device->writeable = 1;
device->work.func = pending_bios_fn;
generate_random_uuid(device->uuid);
@@ -3084,7 +3081,6 @@ static struct btrfs_device *add_missing_dev(struct btrfs_root *root,
return NULL;
list_add(&device->dev_list,
&fs_devices->devices);
- device->barriers = 1;
device->dev_root = root->fs_info->dev_root;
device->devid = devid;
device->work.func = pending_bios_fn;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 31b0fab..2b638b6 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -42,7 +42,6 @@ struct btrfs_device {
int running_pending;
u64 generation;

- int barriers;
int writeable;
int in_fs_metadata;

--
1.7.1

2010-08-25 15:54:00

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 24/30] jbd: replace barriers with explicit flush / FUA usage

From: Christoph Hellwig <[email protected]>

Switch to the WRITE_FLUSH_FUA flag for journal commits and remove the
EOPNOTSUPP detection for barriers.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Jan Kara <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
fs/jbd/commit.c | 30 +++---------------------------
1 files changed, 3 insertions(+), 27 deletions(-)

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 95d8c11..484c5e5 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -137,34 +137,10 @@ static int journal_write_commit_record(journal_t *journal,
JBUFFER_TRACE(descriptor, "write commit block");
set_buffer_dirty(bh);

- if (journal->j_flags & JFS_BARRIER) {
- ret = __sync_dirty_buffer(bh, WRITE_SYNC | WRITE_BARRIER);
-
- /*
- * Is it possible for another commit to fail at roughly
- * the same time as this one? If so, we don't want to
- * trust the barrier flag in the super, but instead want
- * to remember if we sent a barrier request
- */
- if (ret == -EOPNOTSUPP) {
- char b[BDEVNAME_SIZE];
-
- printk(KERN_WARNING
- "JBD: barrier-based sync failed on %s - "
- "disabling barriers\n",
- bdevname(journal->j_dev, b));
- spin_lock(&journal->j_state_lock);
- journal->j_flags &= ~JFS_BARRIER;
- spin_unlock(&journal->j_state_lock);
-
- /* And try again, without the barrier */
- set_buffer_uptodate(bh);
- set_buffer_dirty(bh);
- ret = sync_dirty_buffer(bh);
- }
- } else {
+ if (journal->j_flags & JFS_BARRIER)
+ ret = __sync_dirty_buffer(bh, WRITE_SYNC | WRITE_FLUSH_FUA);
+ else
ret = sync_dirty_buffer(bh);
- }

put_bh(bh); /* One for getblk() */
journal_put_journal_head(descriptor);
--
1.7.1

2010-08-25 15:58:50

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/30] block/loop: queue ordered mode should be DRAIN_FLUSH

loop implements FLUSH using fsync but was incorrectly setting its
ordered mode to DRAIN. Change it to DRAIN_FLUSH. In practice, this
doesn't change anything as loop doesn't make use of the block layer
ordered implementation.

Signed-off-by: Tejun Heo <[email protected]>
---
drivers/block/loop.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f3c636d..c3a4a2e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -832,7 +832,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
lo->lo_queue->unplug_fn = loop_unplug;

if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
- blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN);
+ blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN_FLUSH);

set_capacity(lo->lo_disk, size);
bd_set_size(bdev, size << 9);
--
1.7.1

2010-08-25 20:03:17

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

On Wed 25-08-10 17:57:41, Tejun Heo wrote:
> On 08/25/2010 06:00 PM, Christoph Hellwig wrote:
> > On Wed, Aug 25, 2010 at 05:58:42PM +0200, Christoph Hellwig wrote:
> >> On Wed, Aug 25, 2010 at 05:47:43PM +0200, Tejun Heo wrote:
> >>> From: Christoph Hellwig <[email protected]>
> >>>
> >>> ext4 already uses synchronous discards, no need to add I/O barriers.
> >>
> >> This needs the patch that Jan sent in reply to my initial version merged
> >> into it.
> >
> > Actually the jbd2 patch needs it merged, but the point still stands.
>
> Yeah, wasn't sure about that one. Has anyone tested it? I'll be
> happy to merge it but I have no idea whether it's correct or not and
> Jan didn't seem to have tested it so... Jan, shall I merge the patch?
I'm quite confident the patch is correct so you can merge it I think but
tomorrow I'll give it some crash testing together with the rest of your
patch set in KVM to be sure.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-08-26 08:28:59

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 24.5/30] jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier

>From 49f4cef00a1bd3c79fb2fe1f982c5157f0792867 Mon Sep 17 00:00:00 2001
From: Jan Kara <[email protected]>

Currently JBD2 relies blkdev_issue_flush() draining the queue when ASYNC_COMMIT
feature is set. This property is going away so make JBD2 wait for buffers it
needs on its own before submitting the cache flush.

Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
This patch is necessary before enabling flush/fua support in jbd2.
The flush-fua git tree has been udpated to included this between patch
24 and 25.

Thanks.

fs/jbd2/commit.c | 29 ++++++++++++++++-------------
1 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 7c068c1..8797fd1 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -701,6 +701,16 @@ start_journal_io:
}
}

+ err = journal_finish_inode_data_buffers(journal, commit_transaction);
+ if (err) {
+ printk(KERN_WARNING
+ "JBD2: Detected IO errors while flushing file data "
+ "on %s\n", journal->j_devname);
+ if (journal->j_flags & JBD2_ABORT_ON_SYNCDATA_ERR)
+ jbd2_journal_abort(journal, err);
+ err = 0;
+ }
+
/*
* If the journal is not located on the file system device,
* then we must flush the file system device before we issue
@@ -719,19 +729,6 @@ start_journal_io:
&cbh, crc32_sum);
if (err)
__jbd2_journal_abort_hard(journal);
- if (journal->j_flags & JBD2_BARRIER)
- blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL,
- BLKDEV_IFL_WAIT);
- }
-
- err = journal_finish_inode_data_buffers(journal, commit_transaction);
- if (err) {
- printk(KERN_WARNING
- "JBD2: Detected IO errors while flushing file data "
- "on %s\n", journal->j_devname);
- if (journal->j_flags & JBD2_ABORT_ON_SYNCDATA_ERR)
- jbd2_journal_abort(journal, err);
- err = 0;
}

/* Lo and behold: we have just managed to send a transaction to
@@ -845,6 +842,12 @@ wait_for_iobuf:
}
if (!err && !is_journal_aborted(journal))
err = journal_wait_on_commit_record(journal, cbh);
+ if (JBD2_HAS_INCOMPAT_FEATURE(journal,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT) &&
+ journal->j_flags & JBD2_BARRIER) {
+ blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL,
+ BLKDEV_IFL_WAIT);
+ }

if (err)
jbd2_journal_abort(journal, err);
--
1.7.1

2010-08-26 08:31:55

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

On 08/25/2010 10:02 PM, Jan Kara wrote:
> On Wed 25-08-10 17:57:41, Tejun Heo wrote:
>> On 08/25/2010 06:00 PM, Christoph Hellwig wrote:
>>> On Wed, Aug 25, 2010 at 05:58:42PM +0200, Christoph Hellwig wrote:
>>>> On Wed, Aug 25, 2010 at 05:47:43PM +0200, Tejun Heo wrote:
>>>>> From: Christoph Hellwig <[email protected]>
>>>>>
>>>>> ext4 already uses synchronous discards, no need to add I/O barriers.
>>>>
>>>> This needs the patch that Jan sent in reply to my initial version merged
>>>> into it.
>>>
>>> Actually the jbd2 patch needs it merged, but the point still stands.
>>
>> Yeah, wasn't sure about that one. Has anyone tested it? I'll be
>> happy to merge it but I have no idea whether it's correct or not and
>> Jan didn't seem to have tested it so... Jan, shall I merge the patch?
> I'm quite confident the patch is correct so you can merge it I think but
> tomorrow I'll give it some crash testing together with the rest of your
> patch set in KVM to be sure.

Patch included in the series before jbd2 conversion patch.

Thanks.

--
tejun

2010-08-26 09:35:13

by Sergei Shtylyov

[permalink] [raw]
Subject: Re: [PATCH 24.5/30] jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier

Hello.

Tejun Heo wrote:

> From 49f4cef00a1bd3c79fb2fe1f982c5157f0792867 Mon Sep 17 00:00:00 2001
> From: Jan Kara <[email protected]>

> Currently JBD2 relies blkdev_issue_flush() draining the queue when ASYNC_COMMIT
> feature is set. This property is going away so make JBD2 wait for buffers it
> needs on its own before submitting the cache flush.

> Signed-off-by: Jan Kara <[email protected]>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> This patch is necessary before enabling flush/fua support in jbd2.
> The flush-fua git tree has been udpated to included this between patch
> 24 and 25.

> Thanks.

> fs/jbd2/commit.c | 29 ++++++++++++++++-------------
> 1 files changed, 16 insertions(+), 13 deletions(-)

> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 7c068c1..8797fd1 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
[...]
> @@ -845,6 +842,12 @@ wait_for_iobuf:
> }
> if (!err && !is_journal_aborted(journal))
> err = journal_wait_on_commit_record(journal, cbh);
> + if (JBD2_HAS_INCOMPAT_FEATURE(journal,
> + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT) &&
> + journal->j_flags & JBD2_BARRIER) {
> + blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL,

Overindented line.

> + BLKDEV_IFL_WAIT);
> + }
>
> if (err)
> jbd2_journal_abort(journal, err);

WBR, Sergei

2010-08-26 09:42:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH UPDATED 24.5/30] jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier

Currently JBD2 relies blkdev_issue_flush() draining the queue when ASYNC_COMMIT
feature is set. This property is going away so make JBD2 wait for buffers it
needs on its own before submitting the cache flush.

Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
Fixed overindentation noticed by Sergei. git tree updated accordingly.

Thanks.

fs/jbd2/commit.c | 29 ++++++++++++++++-------------
1 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 7c068c1..d6aeb1f 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -701,6 +701,16 @@ start_journal_io:
}
}

+ err = journal_finish_inode_data_buffers(journal, commit_transaction);
+ if (err) {
+ printk(KERN_WARNING
+ "JBD2: Detected IO errors while flushing file data "
+ "on %s\n", journal->j_devname);
+ if (journal->j_flags & JBD2_ABORT_ON_SYNCDATA_ERR)
+ jbd2_journal_abort(journal, err);
+ err = 0;
+ }
+
/*
* If the journal is not located on the file system device,
* then we must flush the file system device before we issue
@@ -719,19 +729,6 @@ start_journal_io:
&cbh, crc32_sum);
if (err)
__jbd2_journal_abort_hard(journal);
- if (journal->j_flags & JBD2_BARRIER)
- blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL,
- BLKDEV_IFL_WAIT);
- }
-
- err = journal_finish_inode_data_buffers(journal, commit_transaction);
- if (err) {
- printk(KERN_WARNING
- "JBD2: Detected IO errors while flushing file data "
- "on %s\n", journal->j_devname);
- if (journal->j_flags & JBD2_ABORT_ON_SYNCDATA_ERR)
- jbd2_journal_abort(journal, err);
- err = 0;
}

/* Lo and behold: we have just managed to send a transaction to
@@ -845,6 +842,12 @@ wait_for_iobuf:
}
if (!err && !is_journal_aborted(journal))
err = journal_wait_on_commit_record(journal, cbh);
+ if (JBD2_HAS_INCOMPAT_FEATURE(journal,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT) &&
+ journal->j_flags & JBD2_BARRIER) {
+ blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL,
+ BLKDEV_IFL_WAIT);
+ }

if (err)
jbd2_journal_abort(journal, err);
--
1.7.1

2010-08-26 09:55:03

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH] block: update documentation for REQ_FLUSH / REQ_FUA


Signed-off-by: Christoph Hellwig <[email protected]>

Index: linux-2.6/Documentation/block/barrier.txt
===================================================================
--- linux-2.6.orig/Documentation/block/barrier.txt 2010-08-26 06:46:20.993253858 -0300
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,261 +0,0 @@
-I/O Barriers
-============
-Tejun Heo <[email protected]>, July 22 2005
-
-I/O barrier requests are used to guarantee ordering around the barrier
-requests. Unless you're crazy enough to use disk drives for
-implementing synchronization constructs (wow, sounds interesting...),
-the ordering is meaningful only for write requests for things like
-journal checkpoints. All requests queued before a barrier request
-must be finished (made it to the physical medium) before the barrier
-request is started, and all requests queued after the barrier request
-must be started only after the barrier request is finished (again,
-made it to the physical medium).
-
-In other words, I/O barrier requests have the following two properties.
-
-1. Request ordering
-
-Requests cannot pass the barrier request. Preceding requests are
-processed before the barrier and following requests after.
-
-Depending on what features a drive supports, this can be done in one
-of the following three ways.
-
-i. For devices which have queue depth greater than 1 (TCQ devices) and
-support ordered tags, block layer can just issue the barrier as an
-ordered request and the lower level driver, controller and drive
-itself are responsible for making sure that the ordering constraint is
-met. Most modern SCSI controllers/drives should support this.
-
-NOTE: SCSI ordered tag isn't currently used due to limitation in the
- SCSI midlayer, see the following random notes section.
-
-ii. For devices which have queue depth greater than 1 but don't
-support ordered tags, block layer ensures that the requests preceding
-a barrier request finishes before issuing the barrier request. Also,
-it defers requests following the barrier until the barrier request is
-finished. Older SCSI controllers/drives and SATA drives fall in this
-category.
-
-iii. Devices which have queue depth of 1. This is a degenerate case
-of ii. Just keeping issue order suffices. Ancient SCSI
-controllers/drives and IDE drives are in this category.
-
-2. Forced flushing to physical medium
-
-Again, if you're not gonna do synchronization with disk drives (dang,
-it sounds even more appealing now!), the reason you use I/O barriers
-is mainly to protect filesystem integrity when power failure or some
-other events abruptly stop the drive from operating and possibly make
-the drive lose data in its cache. So, I/O barriers need to guarantee
-that requests actually get written to non-volatile medium in order.
-
-There are four cases,
-
-i. No write-back cache. Keeping requests ordered is enough.
-
-ii. Write-back cache but no flush operation. There's no way to
-guarantee physical-medium commit order. This kind of devices can't to
-I/O barriers.
-
-iii. Write-back cache and flush operation but no FUA (forced unit
-access). We need two cache flushes - before and after the barrier
-request.
-
-iv. Write-back cache, flush operation and FUA. We still need one
-flush to make sure requests preceding a barrier are written to medium,
-but post-barrier flush can be avoided by using FUA write on the
-barrier itself.
-
-
-How to support barrier requests in drivers
-------------------------------------------
-
-All barrier handling is done inside block layer proper. All low level
-drivers have to are implementing its prepare_flush_fn and using one
-the following two functions to indicate what barrier type it supports
-and how to prepare flush requests. Note that the term 'ordered' is
-used to indicate the whole sequence of performing barrier requests
-including draining and flushing.
-
-typedef void (prepare_flush_fn)(struct request_queue *q, struct request *rq);
-
-int blk_queue_ordered(struct request_queue *q, unsigned ordered,
- prepare_flush_fn *prepare_flush_fn);
-
-@q : the queue in question
-@ordered : the ordered mode the driver/device supports
-@prepare_flush_fn : this function should prepare @rq such that it
- flushes cache to physical medium when executed
-
-For example, SCSI disk driver's prepare_flush_fn looks like the
-following.
-
-static void sd_prepare_flush(struct request_queue *q, struct request *rq)
-{
- memset(rq->cmd, 0, sizeof(rq->cmd));
- rq->cmd_type = REQ_TYPE_BLOCK_PC;
- rq->timeout = SD_TIMEOUT;
- rq->cmd[0] = SYNCHRONIZE_CACHE;
- rq->cmd_len = 10;
-}
-
-The following seven ordered modes are supported. The following table
-shows which mode should be used depending on what features a
-device/driver supports. In the leftmost column of table,
-QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
-
-The table is followed by description of each mode. Note that in the
-descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
-used for QUEUE_ORDERED_TAG* descriptions. '=>' indicates that the
-preceding step must be complete before proceeding to the next step.
-'->' indicates that the next step can start as soon as the previous
-step is issued.
-
- write-back cache ordered tag flush FUA
------------------------------------------------------------------------
-NONE yes/no N/A no N/A
-DRAIN no no N/A N/A
-DRAIN_FLUSH yes no yes no
-DRAIN_FUA yes no yes yes
-TAG no yes N/A N/A
-TAG_FLUSH yes yes yes no
-TAG_FUA yes yes yes yes
-
-
-QUEUE_ORDERED_NONE
- I/O barriers are not needed and/or supported.
-
- Sequence: N/A
-
-QUEUE_ORDERED_DRAIN
- Requests are ordered by draining the request queue and cache
- flushing isn't needed.
-
- Sequence: drain => barrier
-
-QUEUE_ORDERED_DRAIN_FLUSH
- Requests are ordered by draining the request queue and both
- pre-barrier and post-barrier cache flushings are needed.
-
- Sequence: drain => preflush => barrier => postflush
-
-QUEUE_ORDERED_DRAIN_FUA
- Requests are ordered by draining the request queue and
- pre-barrier cache flushing is needed. By using FUA on barrier
- request, post-barrier flushing can be skipped.
-
- Sequence: drain => preflush => barrier
-
-QUEUE_ORDERED_TAG
- Requests are ordered by ordered tag and cache flushing isn't
- needed.
-
- Sequence: barrier
-
-QUEUE_ORDERED_TAG_FLUSH
- Requests are ordered by ordered tag and both pre-barrier and
- post-barrier cache flushings are needed.
-
- Sequence: preflush -> barrier -> postflush
-
-QUEUE_ORDERED_TAG_FUA
- Requests are ordered by ordered tag and pre-barrier cache
- flushing is needed. By using FUA on barrier request,
- post-barrier flushing can be skipped.
-
- Sequence: preflush -> barrier
-
-
-Random notes/caveats
---------------------
-
-* SCSI layer currently can't use TAG ordering even if the drive,
-controller and driver support it. The problem is that SCSI midlayer
-request dispatch function is not atomic. It releases queue lock and
-switch to SCSI host lock during issue and it's possible and likely to
-happen in time that requests change their relative positions. Once
-this problem is solved, TAG ordering can be enabled.
-
-* Currently, no matter which ordered mode is used, there can be only
-one barrier request in progress. All I/O barriers are held off by
-block layer until the previous I/O barrier is complete. This doesn't
-make any difference for DRAIN ordered devices, but, for TAG ordered
-devices with very high command latency, passing multiple I/O barriers
-to low level *might* be helpful if they are very frequent. Well, this
-certainly is a non-issue. I'm writing this just to make clear that no
-two I/O barrier is ever passed to low-level driver.
-
-* Completion order. Requests in ordered sequence are issued in order
-but not required to finish in order. Barrier implementation can
-handle out-of-order completion of ordered sequence. IOW, the requests
-MUST be processed in order but the hardware/software completion paths
-are allowed to reorder completion notifications - eg. current SCSI
-midlayer doesn't preserve completion order during error handling.
-
-* Requeueing order. Low-level drivers are free to requeue any request
-after they removed it from the request queue with
-blkdev_dequeue_request(). As barrier sequence should be kept in order
-when requeued, generic elevator code takes care of putting requests in
-order around barrier. See blk_ordered_req_seq() and
-ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
-
-Note that block drivers must not requeue preceding requests while
-completing latter requests in an ordered sequence. Currently, no
-error checking is done against this.
-
-* Error handling. Currently, block layer will report error to upper
-layer if any of requests in an ordered sequence fails. Unfortunately,
-this doesn't seem to be enough. Look at the following request flow.
-QUEUE_ORDERED_TAG_FLUSH is in use.
-
- [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
- still in elevator
-
-Let's say request [2], [3] are write requests to update file system
-metadata (journal or whatever) and [barrier] is used to mark that
-those updates are valid. Consider the following sequence.
-
- i. Requests [0] ~ [post] leaves the request queue and enters
- low-level driver.
- ii. After a while, unfortunately, something goes wrong and the
- drive fails [2]. Note that any of [0], [1] and [3] could have
- completed by this time, but [pre] couldn't have been finished
- as the drive must process it in order and it failed before
- processing that command.
- iii. Error handling kicks in and determines that the error is
- unrecoverable and fails [2], and resumes operation.
- iv. [pre] [barrier] [post] gets processed.
- v. *BOOM* power fails
-
-The problem here is that the barrier request is *supposed* to indicate
-that filesystem update requests [2] and [3] made it safely to the
-physical medium and, if the machine crashes after the barrier is
-written, filesystem recovery code can depend on that. Sadly, that
-isn't true in this case anymore. IOW, the success of a I/O barrier
-should also be dependent on success of some of the preceding requests,
-where only upper layer (filesystem) knows what 'some' is.
-
-This can be solved by implementing a way to tell the block layer which
-requests affect the success of the following barrier request and
-making lower lever drivers to resume operation on error only after
-block layer tells it to do so.
-
-As the probability of this happening is very low and the drive should
-be faulty, implementing the fix is probably an overkill. But, still,
-it's there.
-
-* In previous drafts of barrier implementation, there was fallback
-mechanism such that, if FUA or ordered TAG fails, less fancy ordered
-mode can be selected and the failed barrier request is retried
-automatically. The rationale for this feature was that as FUA is
-pretty new in ATA world and ordered tag was never used widely, there
-could be devices which report to support those features but choke when
-actually given such requests.
-
- This was removed for two reasons 1. it's an overkill 2. it's
-impossible to implement properly when TAG ordering is used as low
-level drivers resume after an error automatically. If it's ever
-needed adding it back and modifying low level drivers accordingly
-shouldn't be difficult.
Index: linux-2.6/Documentation/block/writeback_cache_control.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/block/writeback_cache_control.txt 2010-08-26 06:46:14.555017932 -0300
@@ -0,0 +1,86 @@
+
+Explicit volatile write back cache control
+=====================================
+
+Introduction
+------------
+
+Many storage devices, especially in the consumer market, come with volatile
+write back caches. That means the devices signal I/O completion to the
+operating system before data actually has hit the non-volatile storage. This
+behavior obviously speeds up various workloads, but it means the operating
+system needs to force data out to the non-volatile storage when it performs
+a data integrity operation like fsync, sync or an unmount.
+
+The Linux block layer provides two simple mechanisms that let filesystems
+control the caching behavior of the storage device. These mechanisms are
+a forced cache flush, and the Force Unit Access (FUA) flag for requests.
+
+
+Explicit cache flushes
+----------------------
+
+The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from
+the filesystem and will make sure the volatile cache of the storage device
+has been flushed before the actual I/O operation is started. This explicitly
+guarantees that previously completed write requests are on non-volatile
+storage before the flagged bio starts. In addition the REQ_FLUSH flag can be
+set on an otherwise empty bio structure, which causes only an explicit cache
+flush without any dependent I/O. It is recommend to use
+the blkdev_issue_flush() helper for a pure cache flush.
+
+
+Forced Unit Access
+-----------------
+
+The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
+filesystem and will make sure that I/O completion for this request is only
+signaled after the data has been committed to non-volatile storage.
+
+
+Implementation details for filesystems
+--------------------------------------
+
+Filesystems can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
+worry if the underlying devices need any explicit cache flushing and how
+the Forced Unit Access is implemented. The REQ_FLUSH and REQ_FUA flags
+may both be set on a single bio.
+
+
+Implementation details for make_request_fn based block drivers
+--------------------------------------------------------------
+
+These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
+directly below the submit_bio interface. For remapping drivers the REQ_FUA
+bits need to be propagated to underlying devices, and a global flush needs
+to be implemented for bios with the REQ_FLUSH bit set. For real device
+drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
+on non-empty bios can simply be ignored, and REQ_FLUSH requests without
+data can be completed successfully without doing any work. Drivers for
+devices with volatile caches need to implement the support for these
+flags themselves without any help from the block layer.
+
+
+Implementation details for request_fn based block drivers
+--------------------------------------------------------------
+
+For devices that do not support volatile write caches there is no driver
+support required, the block layer completes empty REQ_FLUSH requests before
+entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
+requests that have a payload. For devices with volatile write caches the
+driver needs to tell the block layer that it supports flushing caches by
+doing:
+
+ blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);
+
+and handle empty REQ_FLUSH requests in its prep_fn/request_fn. Note that
+REQ_FLUSH requests with a payload are automatically turned into a sequence
+of an empty REQ_FLUSH request followed by the actual write by the block
+layer. For devices that also support the FUA bit the block layer needs
+to be told to pass through the REQ_FUA bit using:
+
+ blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);
+
+and the driver must handle write requests that have the REQ_FUA bit set
+in prep_fn/request_fn. If the FUA bit is not natively supported the block
+layer turns it into an empty REQ_FLUSH request after the actual write.
Index: linux-2.6/Documentation/block/00-INDEX
===================================================================
--- linux-2.6.orig/Documentation/block/00-INDEX 2010-08-26 06:46:33.723023240 -0300
+++ linux-2.6/Documentation/block/00-INDEX 2010-08-26 06:46:54.932004457 -0300
@@ -1,7 +1,5 @@
00-INDEX
- This file
-barrier.txt
- - I/O Barriers
biodoc.txt
- Notes on the Generic Block Layer Rewrite in Linux 2.5
capability.txt
@@ -16,3 +14,5 @@ stat.txt
- Block layer statistics in /sys/block/<dev>/stat
switching-sched.txt
- Switching I/O schedulers at runtime
+writeback_cache_control.txt
+ - Control of volatile write back caches

2010-08-27 09:24:41

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] block: update documentation for REQ_FLUSH / REQ_FUA

On 08/26/2010 11:54 AM, Christoph Hellwig wrote:
>
> Signed-off-by: Christoph Hellwig <[email protected]>

applied to misc#flush-fua.

Thanks.

--
tejun

2010-08-27 17:32:45

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

On Thu 26-08-10 10:25:47, Tejun Heo wrote:
> On 08/25/2010 10:02 PM, Jan Kara wrote:
> > On Wed 25-08-10 17:57:41, Tejun Heo wrote:
> >> On 08/25/2010 06:00 PM, Christoph Hellwig wrote:
> >>> On Wed, Aug 25, 2010 at 05:58:42PM +0200, Christoph Hellwig wrote:
> >>>> On Wed, Aug 25, 2010 at 05:47:43PM +0200, Tejun Heo wrote:
> >>>>> From: Christoph Hellwig <[email protected]>
> >>>>>
> >>>>> ext4 already uses synchronous discards, no need to add I/O barriers.
> >>>>
> >>>> This needs the patch that Jan sent in reply to my initial version merged
> >>>> into it.
> >>>
> >>> Actually the jbd2 patch needs it merged, but the point still stands.
> >>
> >> Yeah, wasn't sure about that one. Has anyone tested it? I'll be
> >> happy to merge it but I have no idea whether it's correct or not and
> >> Jan didn't seem to have tested it so... Jan, shall I merge the patch?
> > I'm quite confident the patch is correct so you can merge it I think but
> > tomorrow I'll give it some crash testing together with the rest of your
> > patch set in KVM to be sure.
>
> Patch included in the series before jbd2 conversion patch.
An update: I've set up an ext4 barrier testing in KVM - run fsstress,
kill KVM at some random moment and check that the filesystem is consistent
(kvm is run in cache=writeback mode to simulate disk cache). About 70 runs
without journal_async_commit passed fine, now I'm running some tests with
the option enabled and the first few rounds passed OK as well.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-08-30 15:38:05

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 04/30] block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush()

On 08/25/2010 06:47 PM, Tejun Heo wrote:
> Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
> requests. Deprecate barrier. All REQ_HARDBARRIERs are failed with
> -EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
> blk_queue_flush().
>
> blk_queue_flush() takes combinations of REQ_FLUSH and FUA. If a
> device has write cache and can flush it, it should set REQ_FLUSH. If
> the device can handle FUA writes, it should also set REQ_FUA.
>
> All blk_queue_ordered() users are converted.
>
> * ORDERED_DRAIN is mapped to 0 which is the default value.
> * ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH.
> * ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Michael S. Tsirkin <[email protected]>
> Cc: Jeremy Fitzhardinge <[email protected]>
> Cc: Chris Wright <[email protected]>
> Cc: FUJITA Tomonori <[email protected]>
> Cc: Boaz Harrosh <[email protected]>

Acked-by: Boaz Harrosh <[email protected]>

Actually osd as support for FUA as well. It's on my todo
to implement it.

Thanks Tejun, as usual, after your visit there is more room
in the house.

> Cc: Geert Uytterhoeven <[email protected]>
> Cc: David S. Miller <[email protected]>
> Cc: Alasdair G Kergon <[email protected]>
> Cc: Pierre Ossman <[email protected]>
> Cc: Stefan Weinhuber <[email protected]>
> ---
<snip>
> drivers/block/osdblk.c | 2 +-
> diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
> index 2284b4f..72d6246 100644
> --- a/drivers/block/osdblk.c
> +++ b/drivers/block/osdblk.c
> @@ -439,7 +439,7 @@ static int osdblk_init_disk(struct osdblk_device *osdev)
> blk_queue_stack_limits(q, osd_request_queue(osdev->osd));
>
> blk_queue_prep_rq(q, blk_queue_start_tag);
> - blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
> + blk_queue_flush(q, REQ_FLUSH);
>
> disk->queue = q;
>
<snip>

Also this mail
On 08/25/2010 06:47 PM, Tejun Heo wrote:
> REQ_HARDBARRIER is deprecated. Remove spurious uses in the following
> users. Please note that other than osdblk, all other uses were
> already spurious before deprecation.
>
> * osdblk: osdblk_rq_fn() won't receive any request with
> REQ_HARDBARRIER set. Remove the test for it.
>
> * pktcdvd: use of REQ_HARDBARRIER in pkt_generic_packet() doesn't mean
> anything. Removed.
>
> * aic7xxx_old: Setting MSG_ORDERED_Q_TAG on REQ_HARDBARRIER is
> spurious. Removed.
>
> * sas_scsi_host: Setting TASK_ATTR_ORDERED on REQ_HARDBARRIER is
> spurious. Removed.
>
> * scsi_tcq: The ordered tag path wasn't being used anyway. Removed.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Boaz Harrosh <[email protected]>

Acked-by: Boaz Harrosh <[email protected]>

> Cc: James Bottomley <[email protected]>
> Cc: Peter Osterlund <[email protected]>
> ---
> drivers/block/osdblk.c | 3 +--
<snip>
> diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
> index 72d6246..87311eb 100644
> --- a/drivers/block/osdblk.c
> +++ b/drivers/block/osdblk.c
> @@ -310,8 +310,7 @@ static void osdblk_rq_fn(struct request_queue *q)
> break;
>
> /* filter out block requests we don't understand */
> - if (rq->cmd_type != REQ_TYPE_FS &&
> - !(rq->cmd_flags & REQ_HARDBARRIER)) {
> + if (rq->cmd_type != REQ_TYPE_FS) {
> blk_end_request_all(rq, 0);
> continue;
> }

2010-08-30 19:59:40

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

Jan Kara <[email protected]> writes:

> An update: I've set up an ext4 barrier testing in KVM - run fsstress,
> kill KVM at some random moment and check that the filesystem is consistent
> (kvm is run in cache=writeback mode to simulate disk cache). About 70 runs

But doesn't your "disk cache" survive the "power cycle" of your guest?
It's tough to tell exactly what you're testing with so few details;
care to elaborate?

Cheers,
Jeff

2010-08-30 20:21:31

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

On Mon 30-08-10 15:56:43, Jeff Moyer wrote:
> Jan Kara <[email protected]> writes:
>
> > An update: I've set up an ext4 barrier testing in KVM - run fsstress,
> > kill KVM at some random moment and check that the filesystem is consistent
> > (kvm is run in cache=writeback mode to simulate disk cache). About 70 runs
>
> But doesn't your "disk cache" survive the "power cycle" of your guest?
Yes, you're right. Thinking about it now the test setup was wrong because
it didn't refuse writes to the VM's data partition after the moment I
killed KVM. Thanks for catching this. I will probably have to use the fault
injection on the host to disallow writing the device at a certain moment.
Or does somebody have a better option?
My setup is that I have a dedicated partition / drive for a filesystem
which is written to from a guest kernel running under KVM. I have set it up
using virtio driver with cache=writeback so that the host caches the writes
in a similar way disk caches them. At some point I just kill the qemu-kvm
process and at that point I'd like to also throw away data cached by the
host...

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-08-30 20:25:38

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

On 08/30/2010 05:20 PM, Jan Kara wrote:
> On Mon 30-08-10 15:56:43, Jeff Moyer wrote:
>> Jan Kara<[email protected]> writes:
>>
>>> An update: I've set up an ext4 barrier testing in KVM - run fsstress,
>>> kill KVM at some random moment and check that the filesystem is consistent
>>> (kvm is run in cache=writeback mode to simulate disk cache). About 70 runs
>> But doesn't your "disk cache" survive the "power cycle" of your guest?
> Yes, you're right. Thinking about it now the test setup was wrong because
> it didn't refuse writes to the VM's data partition after the moment I
> killed KVM. Thanks for catching this. I will probably have to use the fault
> injection on the host to disallow writing the device at a certain moment.
> Or does somebody have a better option?
> My setup is that I have a dedicated partition / drive for a filesystem
> which is written to from a guest kernel running under KVM. I have set it up
> using virtio driver with cache=writeback so that the host caches the writes
> in a similar way disk caches them. At some point I just kill the qemu-kvm
> process and at that point I'd like to also throw away data cached by the
> host...
>
> Honza

Hi Jan,

Not sure if this is relevant, but what we have been using for part of the
testing is an external e-sata enclosure that you can stick pretty much any S-ATA
disk into. Important to drop power to the external disk (do not pull the s-ata
cable, the firmware will destage the write cache for some/many disks if it has
power and sees link loss :)).

Once you turn the drive back on, the test was can you mount without error,
unmount and do a fsck -f to verify no meta-data corruption,

Ric

2010-08-30 20:40:56

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

Jan Kara, on 08/31/2010 12:20 AM wrote:
> On Mon 30-08-10 15:56:43, Jeff Moyer wrote:
>> Jan Kara<[email protected]> writes:
>>
>>> An update: I've set up an ext4 barrier testing in KVM - run fsstress,
>>> kill KVM at some random moment and check that the filesystem is consistent
>>> (kvm is run in cache=writeback mode to simulate disk cache). About 70 runs
>>
>> But doesn't your "disk cache" survive the "power cycle" of your guest?
> Yes, you're right. Thinking about it now the test setup was wrong because
> it didn't refuse writes to the VM's data partition after the moment I
> killed KVM. Thanks for catching this. I will probably have to use the fault
> injection on the host to disallow writing the device at a certain moment.
> Or does somebody have a better option?

Have you considered to setup a second box as an iSCSI target (e.g. with
iSCSI-SCST)? With it killing the connectivity is just a matter of a
single iptables command + a lot more options.

Vlad

2010-08-30 21:03:00

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

On Tue 31-08-10 00:39:41, Vladislav Bolkhovitin wrote:
> Jan Kara, on 08/31/2010 12:20 AM wrote:
> >On Mon 30-08-10 15:56:43, Jeff Moyer wrote:
> >>Jan Kara<[email protected]> writes:
> >>
> >>> An update: I've set up an ext4 barrier testing in KVM - run fsstress,
> >>>kill KVM at some random moment and check that the filesystem is consistent
> >>>(kvm is run in cache=writeback mode to simulate disk cache). About 70 runs
> >>
> >>But doesn't your "disk cache" survive the "power cycle" of your guest?
> > Yes, you're right. Thinking about it now the test setup was wrong because
> >it didn't refuse writes to the VM's data partition after the moment I
> >killed KVM. Thanks for catching this. I will probably have to use the fault
> >injection on the host to disallow writing the device at a certain moment.
> >Or does somebody have a better option?
>
> Have you considered to setup a second box as an iSCSI target (e.g.
> with iSCSI-SCST)? With it killing the connectivity is just a matter
> of a single iptables command + a lot more options.
Hmm, this might be an interesting option. Will try that. Thanks for
suggestion.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-08-30 21:04:10

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

Jan Kara <[email protected]> writes:

> On Mon 30-08-10 15:56:43, Jeff Moyer wrote:
>> Jan Kara <[email protected]> writes:
>>
>> > An update: I've set up an ext4 barrier testing in KVM - run fsstress,
>> > kill KVM at some random moment and check that the filesystem is consistent
>> > (kvm is run in cache=writeback mode to simulate disk cache). About 70 runs
>>
>> But doesn't your "disk cache" survive the "power cycle" of your guest?
> Yes, you're right. Thinking about it now the test setup was wrong because
> it didn't refuse writes to the VM's data partition after the moment I
> killed KVM. Thanks for catching this. I will probably have to use the fault
> injection on the host to disallow writing the device at a certain moment.
> Or does somebody have a better option?
> My setup is that I have a dedicated partition / drive for a filesystem
> which is written to from a guest kernel running under KVM. I have set it up
> using virtio driver with cache=writeback so that the host caches the writes
> in a similar way disk caches them. At some point I just kill the qemu-kvm
> process and at that point I'd like to also throw away data cached by the
> host...

I've used ilo to power off the system under test from remote. I have a
tool to automate the testing. It works as follows:

There's a client and a server. The server listens on an ip/port for
connections. A client will connect, tell the server it's configuration
(including what disk it's writing to, what block size it's using, and
the total amount of I/O to be done), and then start doing I/O. The I/O
is done using the AIO api, and the data written includes a block number,
a generation number, fill, and a crc. As each completion comes in, the
completed sectors are communicated to the server program. Upon
completion of an entire series of writes (writing the entire data set
once), the server waits some amount of time and then power cycles the
client. The client comes back up and is run in check mode to verify
that all of the data it reported as completed to the server is actually
in tact.

I recently updated the code to run against a file on a file system
(previously it would only work on a block device). It makes use of
stonith modules to do the power cycling. It works, but it isn't the
most elegant bit of engineering I've ever done. ;-)

Anyway, that code is available here:
http://people.redhat.com/jmoyer/dainto-0.99.4.tar.bz2

Cheers,
Jeff

2010-08-31 08:12:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

On 08/30/2010 10:20 PM, Jan Kara wrote:
> My setup is that I have a dedicated partition / drive for a filesystem
> which is written to from a guest kernel running under KVM. I have set it up
> using virtio driver with cache=writeback so that the host caches the writes
> in a similar way disk caches them. At some point I just kill the qemu-kvm
> process and at that point I'd like to also throw away data cached by the
> host...

$ echo 1 > /sys/block/sdX/device/delete
$ echo - - - > /sys/class/scsi_host/hostX/scan

should do the trick.

Thanks.

--
tejun

2010-08-31 09:55:34

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

On 08/31/2010 12:02 AM, Jan Kara wrote:
> On Tue 31-08-10 00:39:41, Vladislav Bolkhovitin wrote:
>> Jan Kara, on 08/31/2010 12:20 AM wrote:
>>> On Mon 30-08-10 15:56:43, Jeff Moyer wrote:
>>>> Jan Kara<[email protected]> writes:
>>>>
>>>>> An update: I've set up an ext4 barrier testing in KVM - run fsstress,
>>>>> kill KVM at some random moment and check that the filesystem is consistent
>>>>> (kvm is run in cache=writeback mode to simulate disk cache). About 70 runs
>>>>
>>>> But doesn't your "disk cache" survive the "power cycle" of your guest?
>>> Yes, you're right. Thinking about it now the test setup was wrong because
>>> it didn't refuse writes to the VM's data partition after the moment I
>>> killed KVM. Thanks for catching this. I will probably have to use the fault
>>> injection on the host to disallow writing the device at a certain moment.
>>> Or does somebody have a better option?
>>
>> Have you considered to setup a second box as an iSCSI target (e.g.
>> with iSCSI-SCST)? With it killing the connectivity is just a matter
>> of a single iptables command + a lot more options.

Still same problem no? the data is still cached on the backing store device
how do you trash the cached data?

> Hmm, this might be an interesting option. Will try that. Thanks for
> suggestion.
>
> Honza

with stgt it's very simple as well. It's a user mode application.
All on the same machine:
- run stgt application
- login + mount a filesystem
- run test
- kill -9 stgt mid flight

But how to throw away the data on the backing store cache?

Boaz

2010-08-31 10:08:05

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

On 08/31/2010 11:11 AM, Tejun Heo wrote:
> On 08/30/2010 10:20 PM, Jan Kara wrote:
>> My setup is that I have a dedicated partition / drive for a filesystem
>> which is written to from a guest kernel running under KVM. I have set it up
>> using virtio driver with cache=writeback so that the host caches the writes
>> in a similar way disk caches them. At some point I just kill the qemu-kvm
>> process and at that point I'd like to also throw away data cached by the
>> host...
>
> $ echo 1 > /sys/block/sdX/device/delete
> $ echo - - - > /sys/class/scsi_host/hostX/scan
>

I don't know all the specifics of the virtio driver and the KVM backend but
don't the KVM target io is eventually directed to a local file or device?
If so the scsi device has disappeard but the bulk of the data is in host cache
at the backstore (file or bdev). Once all files are closed the data is synced
to disk.

Is it not the same as Ric's problem of disconnecting the sata cable but
not dropping power to the drive. The main of the cache is still intact.

> should do the trick.
>
> Thanks.
>

Thanks
Boaz

2010-08-31 10:14:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

Hello,

On 08/31/2010 12:07 PM, Boaz Harrosh wrote:
> I don't know all the specifics of the virtio driver and the KVM backend but
> don't the KVM target io is eventually directed to a local file or device?
> If so the scsi device has disappeard but the bulk of the data is in host cache
> at the backstore (file or bdev). Once all files are closed the data is synced
> to disk.
>
> Is it not the same as Ric's problem of disconnecting the sata cable but
> not dropping power to the drive. The main of the cache is still intact.

There are two layers of caching there.

drive cache - host page cache - guest

When guest issues FLUSH, qemu will translate it into fdatasync which
will flush the host page cache followed by FLUSH to the drive which
will flush the drive cache to the media. If you delete the host disk
device, it will be detached w/o host page cache flushed. So, although
it's not complete, it will lose good part of cache. With out write
out timeout increased and/or with laptop mode enabled, it will
probably lose most of cache.

Thanks.

--
tejun

2010-08-31 10:27:15

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 26/30] ext4: do not send discards as barriers

On 08/31/2010 01:13 PM, Tejun Heo wrote:
> Hello,
>
> On 08/31/2010 12:07 PM, Boaz Harrosh wrote:
>> I don't know all the specifics of the virtio driver and the KVM backend but
>> don't the KVM target io is eventually directed to a local file or device?
>> If so the scsi device has disappeard but the bulk of the data is in host cache
>> at the backstore (file or bdev). Once all files are closed the data is synced
>> to disk.
>>
>> Is it not the same as Ric's problem of disconnecting the sata cable but
>> not dropping power to the drive. The main of the cache is still intact.
>
> There are two layers of caching there.
>
> drive cache - host page cache - guest
>
> When guest issues FLUSH, qemu will translate it into fdatasync which
> will flush the host page cache followed by FLUSH to the drive which
> will flush the drive cache to the media. If you delete the host disk
> device, it will be detached w/o host page cache flushed. So, although
> it's not complete, it will lose good part of cache. With out write
> out timeout increased and/or with laptop mode enabled, it will
> probably lose most of cache.
>

Ha, ok you meant that device. So if you have a dedicated physical device
for backstore that would be a very nice scriptable way.

Thanks, that's a much better automated test than pulling drives out of
sockets.

> Thanks.
>

Boaz