LinuxLists.cc - [PATCHSET 2.6.36-rc2] block, dm: finish REQ_FLUSH/FUA conversion, take#2

2010-08-30 10:05:36

[permalink] [raw]

Subject: [PATCHSET 2.6.36-rc2] block, dm: finish REQ_FLUSH/FUA conversion, take#2

Hello,

This is the second take of block-dm-finish-REQ_FLUSH-FUA-conversion.
I've put patches on top of the previous patches for easier review.
The series can be trivially reordered so that the order is more
logical. Jens, please let me know if you want it to be reordered.

The dm conversion is _lightly_ tested. Please proceed with caution.
Let's hold off merging these bits until dm people can verify the
conversion is correct.

0001-block-make-__blk_rq_prep_clone-copy-most-command-fla.patch
0002-dm-implement-REQ_FLUSH-FUA-support-for-bio-based-dm.patch
0003-dm-relax-ordering-of-bio-based-flush-implementation.patch
0004-dm-implement-REQ_FLUSH-FUA-support-for-request-based.patch
0005-block-remove-the-WRITE_BARRIER-flag.patch

Differences from the previous attempt[1] are,

* bio-based dm and requested-based dm patches are split. 0002-0003
convert bio-based dm and 0004 converts requested-based dm.

* The previous requested based dm conversion was broken in the way
requests are sequenced. This patch rips out special multi-target
handling for flushes and handles flushes the same way other requests
are handled.

This patchset is on top of "block, fs: replace HARDBARRIER with
FLUSH/FUA" patchset[2] and available in the following git tree

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua

and contains the following changes.

block/blk-core.c | 4
drivers/md/dm-crypt.c | 2
drivers/md/dm-io.c | 20 --
drivers/md/dm-log.c | 2
drivers/md/dm-raid1.c | 8
drivers/md/dm-region-hash.c | 16 -
drivers/md/dm-snap-persistent.c | 2
drivers/md/dm-snap.c | 6
drivers/md/dm-stripe.c | 2
drivers/md/dm.c | 394 ++++++++--------------------------------
include/linux/blk_types.h | 1
include/linux/fs.h | 3
12 files changed, 104 insertions(+), 356 deletions(-)

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.raid/29344
[2] http://thread.gmane.org/gmane.linux.kernel/1022363

2010-08-30 10:05:33

[permalink] [raw]

Subject: [PATCH 1/5] block: make __blk_rq_prep_clone() copy most command flags

Currently __blk_rq_prep_clone() copies only REQ_WRITE and REQ_DISCARD.
There's no reason to omit other command flags and REQ_FUA needs to be
copied to implement FUA support in request-based dm.

REQ_COMMON_MASK which specifies flags to be copied from bio to request
already identifies all the command flags. Define REQ_CLONE_MASK to be
the same as REQ_COMMON_MASK for clarity and make __blk_rq_prep_clone()
copy all flags in the mask.

Signed-off-by: Tejun Heo <[email protected]>
---
block/blk-core.c | 4 +---
include/linux/blk_types.h | 1 +
2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 495bdc4..2a5b192 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2505,9 +2505,7 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
static void __blk_rq_prep_clone(struct request *dst, struct request *src)
{
dst->cpu = src->cpu;
- dst->cmd_flags = (rq_data_dir(src) | REQ_NOMERGE);
- if (src->cmd_flags & REQ_DISCARD)
- dst->cmd_flags |= REQ_DISCARD;
+ dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
dst->cmd_type = src->cmd_type;
dst->__sector = blk_rq_pos(src);
dst->__data_len = blk_rq_bytes(src);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1797994..36edadf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -168,6 +168,7 @@ enum rq_flag_bits {
#define REQ_COMMON_MASK \
(REQ_WRITE | REQ_FAILFAST_MASK | REQ_HARDBARRIER | REQ_SYNC | \
REQ_META | REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)
+#define REQ_CLONE_MASK REQ_COMMON_MASK

#define REQ_UNPLUG (1 << __REQ_UNPLUG)
#define REQ_RAHEAD (1 << __REQ_RAHEAD)
--
1.7.1

2010-08-30 10:05:31

[permalink] [raw]

Subject: [PATCH 5/5] block: remove the WRITE_BARRIER flag

From: Christoph Hellwig <[email protected]>

It's unused now.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/fs.h | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32703a9..6b0f6e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -135,7 +135,6 @@ struct inodes_stat_t {
* immediately after submission. The write equivalent
* of READ_SYNC.
* WRITE_ODIRECT_PLUG Special case write for O_DIRECT only.
- * WRITE_BARRIER DEPRECATED. Always fails. Use FLUSH/FUA instead.
* WRITE_FLUSH Like WRITE_SYNC but with preceding cache flush.
* WRITE_FUA Like WRITE_SYNC but data is guaranteed to be on
* non-volatile media on completion.
@@ -157,8 +156,6 @@ struct inodes_stat_t {
#define WRITE_SYNC (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG)
#define WRITE_ODIRECT_PLUG (WRITE | REQ_SYNC)
#define WRITE_META (WRITE | REQ_META)
-#define WRITE_BARRIER (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
- REQ_HARDBARRIER)
#define WRITE_FLUSH (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
REQ_FLUSH)
#define WRITE_FUA (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
--
1.7.1

2010-08-30 10:05:37

[permalink] [raw]

Subject: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm

This patch converts request-based dm to support the new REQ_FLUSH/FUA.

The original request-based flush implementation depended on
request_queue blocking other requests while a barrier sequence is in
progress, which is no longer true for the new REQ_FLUSH/FUA.

In general, request-based dm doesn't have infrastructure for cloning
one source request to multiple targets, but the original flush
implementation had a special mostly independent path which can issue
flushes to multiple targets and sequence them. However, the
capability isn't currently in use and adds a lot of complexity.
Moreoever, it's unlikely to be useful in its current form as it
doesn't make sense to be able to send out flushes to multiple targets
when write requests can't be.

This patch rips out special flush code path and deals handles
REQ_FLUSH/FUA requests the same way as other requests. The only
special treatment is that REQ_FLUSH requests use the block address 0
when finding target, which is enough for now.

Signed-off-by: Tejun Heo <[email protected]>
---
drivers/md/dm.c | 204 ++++++-------------------------------------------------
1 files changed, 20 insertions(+), 184 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e67c519..81a012f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -143,20 +143,9 @@ struct mapped_device {
spinlock_t deferred_lock;

/*
- * Protect barrier_error from concurrent endio processing
- * in request-based dm.
- */
- spinlock_t barrier_error_lock;
- int barrier_error;
-
- /*
- * Processing queue (flush/barriers)
+ * Processing queue (flush)
*/
struct workqueue_struct *wq;
- struct work_struct barrier_work;
-
- /* A pointer to the currently processing pre/post flush request */
- struct request *flush_request;

/*
* The current mapping.
@@ -732,23 +721,6 @@ static void end_clone_bio(struct bio *clone, int error)
blk_update_request(tio->orig, 0, nr_bytes);
}

-static void store_barrier_error(struct mapped_device *md, int error)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&md->barrier_error_lock, flags);
- /*
- * Basically, the first error is taken, but:
- * -EOPNOTSUPP supersedes any I/O error.
- * Requeue request supersedes any I/O error but -EOPNOTSUPP.
- */
- if (!md->barrier_error || error == -EOPNOTSUPP ||
- (md->barrier_error != -EOPNOTSUPP &&
- error == DM_ENDIO_REQUEUE))
- md->barrier_error = error;
- spin_unlock_irqrestore(&md->barrier_error_lock, flags);
-}
-
/*
* Don't touch any member of the md after calling this function because
* the md may be freed in dm_put() at the end of this function.
@@ -786,13 +758,11 @@ static void free_rq_clone(struct request *clone)
static void dm_end_request(struct request *clone, int error)
{
int rw = rq_data_dir(clone);
- int run_queue = 1;
- bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
struct dm_rq_target_io *tio = clone->end_io_data;
struct mapped_device *md = tio->md;
struct request *rq = tio->orig;

- if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+ if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
rq->errors = clone->errors;
rq->resid_len = clone->resid_len;

@@ -806,15 +776,8 @@ static void dm_end_request(struct request *clone, int error)
}

free_rq_clone(clone);
-
- if (unlikely(is_barrier)) {
- if (unlikely(error))
- store_barrier_error(md, error);
- run_queue = 0;
- } else
- blk_end_request_all(rq, error);
-
- rq_completed(md, rw, run_queue);
+ blk_end_request_all(rq, error);
+ rq_completed(md, rw, true);
}

static void dm_unprep_request(struct request *rq)
@@ -839,16 +802,6 @@ void dm_requeue_unmapped_request(struct request *clone)
struct request_queue *q = rq->q;
unsigned long flags;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
- /*
- * Barrier clones share an original request.
- * Leave it to dm_end_request(), which handles this special
- * case.
- */
- dm_end_request(clone, DM_ENDIO_REQUEUE);
- return;
- }
-
dm_unprep_request(rq);

spin_lock_irqsave(q->queue_lock, flags);
@@ -938,19 +891,6 @@ static void dm_complete_request(struct request *clone, int error)
struct dm_rq_target_io *tio = clone->end_io_data;
struct request *rq = tio->orig;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
- /*
- * Barrier clones share an original request. So can't use
- * softirq_done with the original.
- * Pass the clone to dm_done() directly in this special case.
- * It is safe (even if clone->q->queue_lock is held here)
- * because there is no I/O dispatching during the completion
- * of barrier clone.
- */
- dm_done(clone, error, true);
- return;
- }
-
tio->error = error;
rq->completion_data = clone;
blk_complete_request(rq);
@@ -967,17 +907,6 @@ void dm_kill_unmapped_request(struct request *clone, int error)
struct dm_rq_target_io *tio = clone->end_io_data;
struct request *rq = tio->orig;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
- /*
- * Barrier clones share an original request.
- * Leave it to dm_end_request(), which handles this special
- * case.
- */
- BUG_ON(error > 0);
- dm_end_request(clone, error);
- return;
- }
-
rq->cmd_flags |= REQ_FAILED;
dm_complete_request(clone, error);
}
@@ -1507,14 +1436,6 @@ static int dm_request(struct request_queue *q, struct bio *bio)
return _dm_request(q, bio);
}

-static bool dm_rq_is_flush_request(struct request *rq)
-{
- if (rq->cmd_flags & REQ_FLUSH)
- return true;
- else
- return false;
-}
-
void dm_dispatch_request(struct request *rq)
{
int r;
@@ -1562,22 +1483,15 @@ static int setup_clone(struct request *clone, struct request *rq,
{
int r;

- if (dm_rq_is_flush_request(rq)) {
- blk_rq_init(NULL, clone);
- clone->cmd_type = REQ_TYPE_FS;
- clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
- } else {
- r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
- dm_rq_bio_constructor, tio);
- if (r)
- return r;
-
- clone->cmd = rq->cmd;
- clone->cmd_len = rq->cmd_len;
- clone->sense = rq->sense;
- clone->buffer = rq->buffer;
- }
+ r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
+ dm_rq_bio_constructor, tio);
+ if (r)
+ return r;

+ clone->cmd = rq->cmd;
+ clone->cmd_len = rq->cmd_len;
+ clone->sense = rq->sense;
+ clone->buffer = rq->buffer;
clone->end_io = end_clone_request;
clone->end_io_data = tio;

@@ -1618,9 +1532,6 @@ static int dm_prep_fn(struct request_queue *q, struct request *rq)
struct mapped_device *md = q->queuedata;
struct request *clone;

- if (unlikely(dm_rq_is_flush_request(rq)))
- return BLKPREP_OK;
-
if (unlikely(rq->special)) {
DMWARN("Already has something in rq->special.");
return BLKPREP_KILL;
@@ -1697,6 +1608,7 @@ static void dm_request_fn(struct request_queue *q)
struct dm_table *map = dm_get_live_table(md);
struct dm_target *ti;
struct request *rq, *clone;
+ sector_t pos;

/*
* For suspend, check blk_queue_stopped() and increment
@@ -1709,15 +1621,12 @@ static void dm_request_fn(struct request_queue *q)
if (!rq)
goto plug_and_out;

- if (unlikely(dm_rq_is_flush_request(rq))) {
- BUG_ON(md->flush_request);
- md->flush_request = rq;
- blk_start_request(rq);
- queue_work(md->wq, &md->barrier_work);
- goto out;
- }
+ /* always use block 0 to find the target for flushes for now */
+ pos = 0;
+ if (!(rq->cmd_flags & REQ_FLUSH))
+ pos = blk_rq_pos(rq);

- ti = dm_table_find_target(map, blk_rq_pos(rq));
+ ti = dm_table_find_target(map, pos);
if (ti->type->busy && ti->type->busy(ti))
goto plug_and_out;

@@ -1888,7 +1797,6 @@ out:
static const struct block_device_operations dm_blk_dops;

static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);

static void dm_init_md_queue(struct mapped_device *md)
{
@@ -1943,7 +1851,6 @@ static struct mapped_device *alloc_dev(int minor)
mutex_init(&md->suspend_lock);
mutex_init(&md->type_lock);
spin_lock_init(&md->deferred_lock);
- spin_lock_init(&md->barrier_error_lock);
rwlock_init(&md->map_lock);
atomic_set(&md->holders, 1);
atomic_set(&md->open_count, 0);
@@ -1966,7 +1873,6 @@ static struct mapped_device *alloc_dev(int minor)
atomic_set(&md->pending[1], 0);
init_waitqueue_head(&md->wait);
INIT_WORK(&md->work, dm_wq_work);
- INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
init_waitqueue_head(&md->eventq);

md->disk->major = _major;
@@ -2220,8 +2126,6 @@ static int dm_init_request_based_queue(struct mapped_device *md)
blk_queue_softirq_done(md->queue, dm_softirq_done);
blk_queue_prep_rq(md->queue, dm_prep_fn);
blk_queue_lld_busy(md->queue, dm_lld_busy);
- /* no flush support for request based dm yet */
- blk_queue_flush(md->queue, 0);

elv_register_queue(md->queue);

@@ -2421,73 +2325,6 @@ static void dm_queue_flush(struct mapped_device *md)
queue_work(md->wq, &md->work);
}

-static void dm_rq_set_target_request_nr(struct request *clone, unsigned request_nr)
-{
- struct dm_rq_target_io *tio = clone->end_io_data;
-
- tio->info.target_request_nr = request_nr;
-}
-
-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
-{
- int i, j;
- struct dm_table *map = dm_get_live_table(md);
- unsigned num_targets = dm_table_get_num_targets(map);
- struct dm_target *ti;
- struct request *clone;
-
- md->barrier_error = 0;
-
- for (i = 0; i < num_targets; i++) {
- ti = dm_table_get_target(map, i);
- for (j = 0; j < ti->num_flush_requests; j++) {
- clone = clone_rq(md->flush_request, md, GFP_NOIO);
- dm_rq_set_target_request_nr(clone, j);
- atomic_inc(&md->pending[rq_data_dir(clone)]);
- map_request(ti, clone, md);
- }
- }
-
- dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
- dm_table_put(map);
-
- return md->barrier_error;
-}
-
-static void dm_rq_barrier_work(struct work_struct *work)
-{
- int error;
- struct mapped_device *md = container_of(work, struct mapped_device,
- barrier_work);
- struct request_queue *q = md->queue;
- struct request *rq;
- unsigned long flags;
-
- /*
- * Hold the md reference here and leave it at the last part so that
- * the md can't be deleted by device opener when the barrier request
- * completes.
- */
- dm_get(md);
-
- error = dm_rq_barrier(md);
-
- rq = md->flush_request;
- md->flush_request = NULL;
-
- if (error == DM_ENDIO_REQUEUE) {
- spin_lock_irqsave(q->queue_lock, flags);
- blk_requeue_request(q, rq);
- spin_unlock_irqrestore(q->queue_lock, flags);
- } else
- blk_end_request_all(rq, error);
-
- blk_run_queue(q);
-
- dm_put(md);
-}
-
/*
* Swap in a new table, returning the old one for the caller to destroy.
*/
@@ -2619,9 +2456,8 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
up_write(&md->io_lock);

/*
- * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
- * can be kicked until md->queue is stopped. So stop md->queue before
- * flushing md->wq.
+ * Stop md->queue before flushing md->wq in case request-based
+ * dm defers requests to md->wq from md->queue.
*/
if (dm_request_based(md))
stop_queue(md->queue);
--
1.7.1

2010-08-30 10:05:28

[permalink] [raw]

Subject: [PATCH 3/5] dm: relax ordering of bio-based flush implementation

Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
against other bio's. This patch relaxes ordering around flushes.

* A flush bio is no longer deferred to workqueue directly. It's
processed like other bio's but __split_and_process_bio() uses
md->flush_bio as the clone source. md->flush_bio is initialized to
empty flush during md initialization and shared for all flushes.

* When dec_pending() detects that a flush has completed, it checks
whether the original bio has data. If so, the bio is queued to the
deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.

* As flush sequencing is handled in the usual issue/completion path,
dm_wq_work() no longer needs to handle flushes differently. Now its
only responsibility is re-issuing deferred bio's the same way as
_dm_request() would. REQ_FLUSH handling logic including
process_flush() is dropped.

* There's no reason for queue_io() and dm_wq_work() write lock
dm->io_lock. queue_io() now only uses md->deferred_lock and
dm_wq_work() read locks dm->io_lock.

* bio's no longer need to be queued on the deferred list while a flush
is in progress making DMF_QUEUE_IO_TO_THREAD unncessary. Drop it.

This avoids stalling the device during flushes and simplifies the
implementation.

Signed-off-by: Tejun Heo <[email protected]>
---
drivers/md/dm.c | 157 ++++++++++++++++---------------------------------------
1 files changed, 45 insertions(+), 112 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 32e6622..e67c519 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -110,7 +110,6 @@ EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo);
#define DMF_FREEING 3
#define DMF_DELETING 4
#define DMF_NOFLUSH_SUSPENDING 5
-#define DMF_QUEUE_IO_TO_THREAD 6

/*
* Work processed by per-device workqueue.
@@ -144,11 +143,6 @@ struct mapped_device {
spinlock_t deferred_lock;

/*
- * An error from the flush request currently being processed.
- */
- int flush_error;
-
- /*
* Protect barrier_error from concurrent endio processing
* in request-based dm.
*/
@@ -529,16 +523,10 @@ static void end_io_acct(struct dm_io *io)
*/
static void queue_io(struct mapped_device *md, struct bio *bio)
{
- down_write(&md->io_lock);
-
spin_lock_irq(&md->deferred_lock);
bio_list_add(&md->deferred, bio);
spin_unlock_irq(&md->deferred_lock);
-
- if (!test_and_set_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags))
- queue_work(md->wq, &md->work);
-
- up_write(&md->io_lock);
+ queue_work(md->wq, &md->work);
}

/*
@@ -626,11 +614,9 @@ static void dec_pending(struct dm_io *io, int error)
* Target requested pushing back the I/O.
*/
spin_lock_irqsave(&md->deferred_lock, flags);
- if (__noflush_suspending(md)) {
- if (!(io->bio->bi_rw & REQ_FLUSH))
- bio_list_add_head(&md->deferred,
- io->bio);
- } else
+ if (__noflush_suspending(md))
+ bio_list_add_head(&md->deferred, io->bio);
+ else
/* noflush suspend was interrupted. */
io->error = -EIO;
spin_unlock_irqrestore(&md->deferred_lock, flags);
@@ -638,26 +624,22 @@ static void dec_pending(struct dm_io *io, int error)

io_error = io->error;
bio = io->bio;
+ end_io_acct(io);
+ free_io(md, io);
+
+ if (io_error == DM_ENDIO_REQUEUE)
+ return;

- if (bio->bi_rw & REQ_FLUSH) {
+ if (!(bio->bi_rw & REQ_FLUSH) || !bio->bi_size) {
+ trace_block_bio_complete(md->queue, bio);
+ bio_endio(bio, io_error);
+ } else {
/*
- * There can be just one flush request so we use
- * a per-device variable for error reporting.
- * Note that you can't touch the bio after end_io_acct
+ * Preflush done for flush with data, reissue
+ * without REQ_FLUSH.
*/
- if (!md->flush_error)
- md->flush_error = io_error;
- end_io_acct(io);
- free_io(md, io);
- } else {
- end_io_acct(io);
- free_io(md, io);
-
- if (io_error != DM_ENDIO_REQUEUE) {
- trace_block_bio_complete(md->queue, bio);
-
- bio_endio(bio, io_error);
- }
+ bio->bi_rw &= ~REQ_FLUSH;
+ queue_io(md, bio);
}
}
}
@@ -1369,21 +1351,17 @@ static int __clone_and_map(struct clone_info *ci)
*/
static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
{
+ bool is_flush = bio->bi_rw & REQ_FLUSH;
struct clone_info ci;
int error = 0;

ci.map = dm_get_live_table(md);
if (unlikely(!ci.map)) {
- if (!(bio->bi_rw & REQ_FLUSH))
- bio_io_error(bio);
- else
- if (!md->flush_error)
- md->flush_error = -EIO;
+ bio_io_error(bio);
return;
}

ci.md = md;
- ci.bio = bio;
ci.io = alloc_io(md);
ci.io->error = 0;
atomic_set(&ci.io->io_count, 1);
@@ -1391,18 +1369,19 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
ci.io->md = md;
spin_lock_init(&ci.io->endio_lock);
ci.sector = bio->bi_sector;
- if (!(bio->bi_rw & REQ_FLUSH))
+ ci.idx = bio->bi_idx;
+
+ if (!is_flush) {
+ ci.bio = bio;
ci.sector_count = bio_sectors(bio);
- else {
- /* all FLUSH bio's reaching here should be empty */
- WARN_ON_ONCE(bio_has_data(bio));
+ } else {
+ ci.bio = &ci.md->flush_bio;
ci.sector_count = 1;
}
- ci.idx = bio->bi_idx;

start_io_acct(ci.io);
while (ci.sector_count && !error) {
- if (!(bio->bi_rw & REQ_FLUSH))
+ if (!is_flush)
error = __clone_and_map(&ci);
else
error = __clone_and_map_flush(&ci);
@@ -1490,22 +1469,14 @@ static int _dm_request(struct request_queue *q, struct bio *bio)
part_stat_add(cpu, &dm_disk(md)->part0, sectors[rw], bio_sectors(bio));
part_stat_unlock();

- /*
- * If we're suspended or the thread is processing flushes
- * we have to queue this io for later.
- */
- if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
- (bio->bi_rw & REQ_FLUSH)) {
+ /* if we're suspended, we have to queue this io for later */
+ if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) {
up_read(&md->io_lock);

- if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
- bio_rw(bio) == READA) {
+ if (bio_rw(bio) != READA)
+ queue_io(md, bio);
+ else
bio_io_error(bio);
- return 0;
- }
-
- queue_io(md, bio);
-
return 0;
}

@@ -2015,6 +1986,10 @@ static struct mapped_device *alloc_dev(int minor)
if (!md->bdev)
goto bad_bdev;

+ bio_init(&md->flush_bio);
+ md->flush_bio.bi_bdev = md->bdev;
+ md->flush_bio.bi_rw = WRITE_FLUSH;
+
/* Populate the mapping, nobody knows we exist yet */
spin_lock(&_minor_lock);
old_md = idr_replace(&_minor_idr, md, minor);
@@ -2407,37 +2382,6 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
return r;
}

-static void process_flush(struct mapped_device *md, struct bio *bio)
-{
- md->flush_error = 0;
-
- /* handle REQ_FLUSH */
- dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
- bio_init(&md->flush_bio);
- md->flush_bio.bi_bdev = md->bdev;
- md->flush_bio.bi_rw = WRITE_FLUSH;
- __split_and_process_bio(md, &md->flush_bio);
-
- dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
- /* if it's an empty flush or the preflush failed, we're done */
- if (!bio_has_data(bio) || md->flush_error) {
- if (md->flush_error != DM_ENDIO_REQUEUE)
- bio_endio(bio, md->flush_error);
- else {
- spin_lock_irq(&md->deferred_lock);
- bio_list_add_head(&md->deferred, bio);
- spin_unlock_irq(&md->deferred_lock);
- }
- return;
- }
-
- /* issue data + REQ_FUA */
- bio->bi_rw &= ~REQ_FLUSH;
- __split_and_process_bio(md, bio);
-}
-
/*
* Process the deferred bios
*/
@@ -2447,33 +2391,27 @@ static void dm_wq_work(struct work_struct *work)
work);
struct bio *c;

- down_write(&md->io_lock);
+ down_read(&md->io_lock);

while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
spin_lock_irq(&md->deferred_lock);
c = bio_list_pop(&md->deferred);
spin_unlock_irq(&md->deferred_lock);

- if (!c) {
- clear_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags);
+ if (!c)
break;
- }

- up_write(&md->io_lock);
+ up_read(&md->io_lock);

if (dm_request_based(md))
generic_make_request(c);
- else {
- if (c->bi_rw & REQ_FLUSH)
- process_flush(md, c);
- else
- __split_and_process_bio(md, c);
- }
+ else
+ __split_and_process_bio(md, c);

- down_write(&md->io_lock);
+ down_read(&md->io_lock);
}

- up_write(&md->io_lock);
+ up_read(&md->io_lock);
}

static void dm_queue_flush(struct mapped_device *md)
@@ -2672,17 +2610,12 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
*
* To get all processes out of __split_and_process_bio in dm_request,
* we take the write lock. To prevent any process from reentering
- * __split_and_process_bio from dm_request, we set
- * DMF_QUEUE_IO_TO_THREAD.
- *
- * To quiesce the thread (dm_wq_work), we set DMF_BLOCK_IO_FOR_SUSPEND
- * and call flush_workqueue(md->wq). flush_workqueue will wait until
- * dm_wq_work exits and DMF_BLOCK_IO_FOR_SUSPEND will prevent any
- * further calls to __split_and_process_bio from dm_wq_work.
+ * __split_and_process_bio from dm_request and quiesce the thread
+ * (dm_wq_work), we set BMF_BLOCK_IO_FOR_SUSPEND and call
+ * flush_workqueue(md->wq).
*/
down_write(&md->io_lock);
set_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags);
- set_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags);
up_write(&md->io_lock);

/*
--
1.7.1

2010-08-30 10:06:43

[permalink] [raw]

Subject: [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm

This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
now deprecated REQ_HARDBARRIER.

* -EOPNOTSUPP handling logic dropped.

* Preflush is handled as before but postflush is dropped and replaced
with passing down REQ_FUA to member request_queues. This replaces
one array wide cache flush w/ member specific FUA writes.

* __split_and_process_bio() now calls __clone_and_map_flush() directly
for flushes and guarantees all FLUSH bio's going to targets are zero
` length.

* It's now guaranteed that all FLUSH bio's which are passed onto dm
targets are zero length. bio_empty_barrier() tests are replaced
with REQ_FLUSH tests.

* Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes.

* Dropped unlikely() around REQ_FLUSH tests. Flushes are not unlikely
enough to be marked with unlikely().

* Block layer now filters out REQ_FLUSH/FUA bio's if the request_queue
doesn't support cache flushing. Advertise REQ_FLUSH | REQ_FUA
capability.

* Request based dm isn't converted yet. dm_init_request_based_queue()
resets flush support to 0 for now. To avoid disturbing request
based dm code, dm->flush_error is added for bio based dm while
requested based dm continues to use dm->barrier_error.

Lightly tested linear, stripe, raid1, snap and crypt targets. Please
proceed with caution as I'm not familiar with the code base.

Signed-off-by: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: Christoph Hellwig <[email protected]>
---
drivers/md/dm-crypt.c | 2 +-
drivers/md/dm-io.c | 20 +-----
drivers/md/dm-log.c | 2 +-
drivers/md/dm-raid1.c | 8 +-
drivers/md/dm-region-hash.c | 16 +++---
drivers/md/dm-snap-persistent.c | 2 +-
drivers/md/dm-snap.c | 6 +-
drivers/md/dm-stripe.c | 2 +-
drivers/md/dm.c | 119 +++++++++++++++++++--------------------
9 files changed, 80 insertions(+), 97 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 368e8e9..d5b0e4c 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1278,7 +1278,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio,
struct dm_crypt_io *io;
struct crypt_config *cc;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio->bi_rw & REQ_FLUSH) {
cc = ti->private;
bio->bi_bdev = cc->dev->bdev;
return DM_MAPIO_REMAPPED;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 0590c75..136d4f7 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -31,7 +31,6 @@ struct dm_io_client {
*/
struct io {
unsigned long error_bits;
- unsigned long eopnotsupp_bits;
atomic_t count;
struct task_struct *sleeper;
struct dm_io_client *client;
@@ -130,11 +129,8 @@ static void retrieve_io_and_region_from_bio(struct bio *bio, struct io **io,
*---------------------------------------------------------------*/
static void dec_count(struct io *io, unsigned int region, int error)
{
- if (error) {
+ if (error)
set_bit(region, &io->error_bits);
- if (error == -EOPNOTSUPP)
- set_bit(region, &io->eopnotsupp_bits);
- }

if (atomic_dec_and_test(&io->count)) {
if (io->sleeper)
@@ -310,8 +306,8 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where,
sector_t remaining = where->count;

/*
- * where->count may be zero if rw holds a write barrier and we
- * need to send a zero-sized barrier.
+ * where->count may be zero if rw holds a flush and we need to
+ * send a zero-sized flush.
*/
do {
/*
@@ -364,7 +360,7 @@ static void dispatch_io(int rw, unsigned int num_regions,
*/
for (i = 0; i < num_regions; i++) {
*dp = old_pages;
- if (where[i].count || (rw & REQ_HARDBARRIER))
+ if (where[i].count || (rw & REQ_FLUSH))
do_region(rw, i, where + i, dp, io);
}

@@ -393,9 +389,7 @@ static int sync_io(struct dm_io_client *client, unsigned int num_regions,
return -EIO;
}

-retry:
io->error_bits = 0;
- io->eopnotsupp_bits = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */
io->sleeper = current;
io->client = client;
@@ -412,11 +406,6 @@ retry:
}
set_current_state(TASK_RUNNING);

- if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
- rw &= ~REQ_HARDBARRIER;
- goto retry;
- }
-
if (error_bits)
*error_bits = io->error_bits;

@@ -437,7 +426,6 @@ static int async_io(struct dm_io_client *client, unsigned int num_regions,

io = mempool_alloc(client->pool, GFP_NOIO);
io->error_bits = 0;
- io->eopnotsupp_bits = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */
io->sleeper = NULL;
io->client = client;
diff --git a/drivers/md/dm-log.c b/drivers/md/dm-log.c
index 5a08be0..33420e6 100644
--- a/drivers/md/dm-log.c
+++ b/drivers/md/dm-log.c
@@ -300,7 +300,7 @@ static int flush_header(struct log_c *lc)
.count = 0,
};

- lc->io_req.bi_rw = WRITE_BARRIER;
+ lc->io_req.bi_rw = WRITE_FLUSH;

return dm_io(&lc->io_req, 1, &null_location, NULL);
}
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index 7c081bc..19a59b0 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -259,7 +259,7 @@ static int mirror_flush(struct dm_target *ti)
struct dm_io_region io[ms->nr_mirrors];
struct mirror *m;
struct dm_io_request io_req = {
- .bi_rw = WRITE_BARRIER,
+ .bi_rw = WRITE_FLUSH,
.mem.type = DM_IO_KMEM,
.mem.ptr.bvec = NULL,
.client = ms->io_client,
@@ -629,7 +629,7 @@ static void do_write(struct mirror_set *ms, struct bio *bio)
struct dm_io_region io[ms->nr_mirrors], *dest = io;
struct mirror *m;
struct dm_io_request io_req = {
- .bi_rw = WRITE | (bio->bi_rw & WRITE_BARRIER),
+ .bi_rw = WRITE | (bio->bi_rw & WRITE_FLUSH_FUA),
.mem.type = DM_IO_BVEC,
.mem.ptr.bvec = bio->bi_io_vec + bio->bi_idx,
.notify.fn = write_callback,
@@ -670,7 +670,7 @@ static void do_writes(struct mirror_set *ms, struct bio_list *writes)
bio_list_init(&requeue);

while ((bio = bio_list_pop(writes))) {
- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio->bi_rw & REQ_FLUSH) {
bio_list_add(&sync, bio);
continue;
}
@@ -1203,7 +1203,7 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio,
* We need to dec pending if this was a write.
*/
if (rw == WRITE) {
- if (likely(!bio_empty_barrier(bio)))
+ if (!(bio->bi_rw & REQ_FLUSH))
dm_rh_dec(ms->rh, map_context->ll);
return error;
}
diff --git a/drivers/md/dm-region-hash.c b/drivers/md/dm-region-hash.c
index bd5c58b..dad011a 100644
--- a/drivers/md/dm-region-hash.c
+++ b/drivers/md/dm-region-hash.c
@@ -81,9 +81,9 @@ struct dm_region_hash {
struct list_head failed_recovered_regions;

/*
- * If there was a barrier failure no regions can be marked clean.
+ * If there was a flush failure no regions can be marked clean.
*/
- int barrier_failure;
+ int flush_failure;

void *context;
sector_t target_begin;
@@ -217,7 +217,7 @@ struct dm_region_hash *dm_region_hash_create(
INIT_LIST_HEAD(&rh->quiesced_regions);
INIT_LIST_HEAD(&rh->recovered_regions);
INIT_LIST_HEAD(&rh->failed_recovered_regions);
- rh->barrier_failure = 0;
+ rh->flush_failure = 0;

rh->region_pool = mempool_create_kmalloc_pool(MIN_REGIONS,
sizeof(struct dm_region));
@@ -399,8 +399,8 @@ void dm_rh_mark_nosync(struct dm_region_hash *rh, struct bio *bio)
region_t region = dm_rh_bio_to_region(rh, bio);
int recovering = 0;

- if (bio_empty_barrier(bio)) {
- rh->barrier_failure = 1;
+ if (bio->bi_rw & REQ_FLUSH) {
+ rh->flush_failure = 1;
return;
}

@@ -524,7 +524,7 @@ void dm_rh_inc_pending(struct dm_region_hash *rh, struct bio_list *bios)
struct bio *bio;

for (bio = bios->head; bio; bio = bio->bi_next) {
- if (bio_empty_barrier(bio))
+ if (bio->bi_rw & REQ_FLUSH)
continue;
rh_inc(rh, dm_rh_bio_to_region(rh, bio));
}
@@ -555,9 +555,9 @@ void dm_rh_dec(struct dm_region_hash *rh, region_t region)
*/

/* do nothing for DM_RH_NOSYNC */
- if (unlikely(rh->barrier_failure)) {
+ if (unlikely(rh->flush_failure)) {
/*
- * If a write barrier failed some time ago, we
+ * If a write flush failed some time ago, we
* don't know whether or not this write made it
* to the disk, so we must resync the device.
*/
diff --git a/drivers/md/dm-snap-persistent.c b/drivers/md/dm-snap-persistent.c
index cc2bdb8..0b61792 100644
--- a/drivers/md/dm-snap-persistent.c
+++ b/drivers/md/dm-snap-persistent.c
@@ -687,7 +687,7 @@ static void persistent_commit_exception(struct dm_exception_store *store,
/*
* Commit exceptions to disk.
*/
- if (ps->valid && area_io(ps, WRITE_BARRIER))
+ if (ps->valid && area_io(ps, WRITE_FLUSH_FUA))
ps->valid = 0;

/*
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 5974d30..eed2101 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1587,7 +1587,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio,
chunk_t chunk;
struct dm_snap_pending_exception *pe = NULL;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio->bi_rw & REQ_FLUSH) {
bio->bi_bdev = s->cow->bdev;
return DM_MAPIO_REMAPPED;
}
@@ -1691,7 +1691,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio,
int r = DM_MAPIO_REMAPPED;
chunk_t chunk;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio->bi_rw & REQ_FLUSH) {
if (!map_context->target_request_nr)
bio->bi_bdev = s->origin->bdev;
else
@@ -2135,7 +2135,7 @@ static int origin_map(struct dm_target *ti, struct bio *bio,
struct dm_dev *dev = ti->private;
bio->bi_bdev = dev->bdev;

- if (unlikely(bio_empty_barrier(bio)))
+ if (bio->bi_rw & REQ_FLUSH)
return DM_MAPIO_REMAPPED;

/* Only tell snapshots if this is a write */
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index c297f6d..f0371b4 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -271,7 +271,7 @@ static int stripe_map(struct dm_target *ti, struct bio *bio,
uint32_t stripe;
unsigned target_request_nr;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio->bi_rw & REQ_FLUSH) {
target_request_nr = map_context->target_request_nr;
BUG_ON(target_request_nr >= sc->stripes);
bio->bi_bdev = sc->stripe[target_request_nr].dev->bdev;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b1d92be..32e6622 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -144,15 +144,16 @@ struct mapped_device {
spinlock_t deferred_lock;

/*
- * An error from the barrier request currently being processed.
+ * An error from the flush request currently being processed.
*/
- int barrier_error;
+ int flush_error;

/*
* Protect barrier_error from concurrent endio processing
* in request-based dm.
*/
spinlock_t barrier_error_lock;
+ int barrier_error;

/*
* Processing queue (flush/barriers)
@@ -200,8 +201,8 @@ struct mapped_device {
/* sysfs handle */
struct kobject kobj;

- /* zero-length barrier that will be cloned and submitted to targets */
- struct bio barrier_bio;
+ /* zero-length flush that will be cloned and submitted to targets */
+ struct bio flush_bio;
};

/*
@@ -512,7 +513,7 @@ static void end_io_acct(struct dm_io *io)

/*
* After this is decremented the bio must not be touched if it is
- * a barrier.
+ * a flush.
*/
dm_disk(md)->part0.in_flight[rw] = pending =
atomic_dec_return(&md->pending[rw]);
@@ -626,7 +627,7 @@ static void dec_pending(struct dm_io *io, int error)
*/
spin_lock_irqsave(&md->deferred_lock, flags);
if (__noflush_suspending(md)) {
- if (!(io->bio->bi_rw & REQ_HARDBARRIER))
+ if (!(io->bio->bi_rw & REQ_FLUSH))
bio_list_add_head(&md->deferred,
io->bio);
} else
@@ -638,20 +639,14 @@ static void dec_pending(struct dm_io *io, int error)
io_error = io->error;
bio = io->bio;

- if (bio->bi_rw & REQ_HARDBARRIER) {
+ if (bio->bi_rw & REQ_FLUSH) {
/*
- * There can be just one barrier request so we use
+ * There can be just one flush request so we use
* a per-device variable for error reporting.
* Note that you can't touch the bio after end_io_acct
- *
- * We ignore -EOPNOTSUPP for empty flush reported by
- * underlying devices. We assume that if the device
- * doesn't support empty barriers, it doesn't need
- * cache flushing commands.
*/
- if (!md->barrier_error &&
- !(bio_empty_barrier(bio) && io_error == -EOPNOTSUPP))
- md->barrier_error = io_error;
+ if (!md->flush_error)
+ md->flush_error = io_error;
end_io_acct(io);
free_io(md, io);
} else {
@@ -1119,7 +1114,7 @@ static void dm_bio_destructor(struct bio *bio)
}

/*
- * Creates a little bio that is just does part of a bvec.
+ * Creates a little bio that just does part of a bvec.
*/
static struct bio *split_bvec(struct bio *bio, sector_t sector,
unsigned short idx, unsigned int offset,
@@ -1134,7 +1129,7 @@ static struct bio *split_bvec(struct bio *bio, sector_t sector,

clone->bi_sector = sector;
clone->bi_bdev = bio->bi_bdev;
- clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER;
+ clone->bi_rw = bio->bi_rw;
clone->bi_vcnt = 1;
clone->bi_size = to_bytes(len);
clone->bi_io_vec->bv_offset = offset;
@@ -1161,7 +1156,6 @@ static struct bio *clone_bio(struct bio *bio, sector_t sector,

clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
__bio_clone(clone, bio);
- clone->bi_rw &= ~REQ_HARDBARRIER;
clone->bi_destructor = dm_bio_destructor;
clone->bi_sector = sector;
clone->bi_idx = idx;
@@ -1225,7 +1219,7 @@ static void __issue_target_requests(struct clone_info *ci, struct dm_target *ti,
__issue_target_request(ci, ti, request_nr, len);
}

-static int __clone_and_map_empty_barrier(struct clone_info *ci)
+static int __clone_and_map_flush(struct clone_info *ci)
{
unsigned target_nr = 0;
struct dm_target *ti;
@@ -1289,9 +1283,6 @@ static int __clone_and_map(struct clone_info *ci)
sector_t len = 0, max;
struct dm_target_io *tio;

- if (unlikely(bio_empty_barrier(bio)))
- return __clone_and_map_empty_barrier(ci);
-
if (unlikely(bio->bi_rw & REQ_DISCARD))
return __clone_and_map_discard(ci);

@@ -1383,11 +1374,11 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)

ci.map = dm_get_live_table(md);
if (unlikely(!ci.map)) {
- if (!(bio->bi_rw & REQ_HARDBARRIER))
+ if (!(bio->bi_rw & REQ_FLUSH))
bio_io_error(bio);
else
- if (!md->barrier_error)
- md->barrier_error = -EIO;
+ if (!md->flush_error)
+ md->flush_error = -EIO;
return;
}

@@ -1400,14 +1391,22 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
ci.io->md = md;
spin_lock_init(&ci.io->endio_lock);
ci.sector = bio->bi_sector;
- ci.sector_count = bio_sectors(bio);
- if (unlikely(bio_empty_barrier(bio)))
+ if (!(bio->bi_rw & REQ_FLUSH))
+ ci.sector_count = bio_sectors(bio);
+ else {
+ /* all FLUSH bio's reaching here should be empty */
+ WARN_ON_ONCE(bio_has_data(bio));
ci.sector_count = 1;
+ }
ci.idx = bio->bi_idx;

start_io_acct(ci.io);
- while (ci.sector_count && !error)
- error = __clone_and_map(&ci);
+ while (ci.sector_count && !error) {
+ if (!(bio->bi_rw & REQ_FLUSH))
+ error = __clone_and_map(&ci);
+ else
+ error = __clone_and_map_flush(&ci);
+ }

/* drop the extra reference count */
dec_pending(ci.io, error);
@@ -1492,11 +1491,11 @@ static int _dm_request(struct request_queue *q, struct bio *bio)
part_stat_unlock();

/*
- * If we're suspended or the thread is processing barriers
+ * If we're suspended or the thread is processing flushes
* we have to queue this io for later.
*/
if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
- unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+ (bio->bi_rw & REQ_FLUSH)) {
up_read(&md->io_lock);

if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
@@ -1940,6 +1939,7 @@ static void dm_init_md_queue(struct mapped_device *md)
blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
md->queue->unplug_fn = dm_unplug_all;
blk_queue_merge_bvec(md->queue, dm_merge_bvec);
+ blk_queue_flush(md->queue, REQ_FLUSH | REQ_FUA);
}

/*
@@ -2245,7 +2245,8 @@ static int dm_init_request_based_queue(struct mapped_device *md)
blk_queue_softirq_done(md->queue, dm_softirq_done);
blk_queue_prep_rq(md->queue, dm_prep_fn);
blk_queue_lld_busy(md->queue, dm_lld_busy);
- blk_queue_flush(md->queue, REQ_FLUSH);
+ /* no flush support for request based dm yet */
+ blk_queue_flush(md->queue, 0);

elv_register_queue(md->queue);

@@ -2406,41 +2407,35 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
return r;
}

-static void dm_flush(struct mapped_device *md)
+static void process_flush(struct mapped_device *md, struct bio *bio)
{
- dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
- bio_init(&md->barrier_bio);
- md->barrier_bio.bi_bdev = md->bdev;
- md->barrier_bio.bi_rw = WRITE_BARRIER;
- __split_and_process_bio(md, &md->barrier_bio);
+ md->flush_error = 0;

+ /* handle REQ_FLUSH */
dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-}

-static void process_barrier(struct mapped_device *md, struct bio *bio)
-{
- md->barrier_error = 0;
+ bio_init(&md->flush_bio);
+ md->flush_bio.bi_bdev = md->bdev;
+ md->flush_bio.bi_rw = WRITE_FLUSH;
+ __split_and_process_bio(md, &md->flush_bio);

- dm_flush(md);
+ dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);

- if (!bio_empty_barrier(bio)) {
- __split_and_process_bio(md, bio);
- /*
- * If the request isn't supported, don't waste time with
- * the second flush.
- */
- if (md->barrier_error != -EOPNOTSUPP)
- dm_flush(md);
+ /* if it's an empty flush or the preflush failed, we're done */
+ if (!bio_has_data(bio) || md->flush_error) {
+ if (md->flush_error != DM_ENDIO_REQUEUE)
+ bio_endio(bio, md->flush_error);
+ else {
+ spin_lock_irq(&md->deferred_lock);
+ bio_list_add_head(&md->deferred, bio);
+ spin_unlock_irq(&md->deferred_lock);
+ }
+ return;
}

- if (md->barrier_error != DM_ENDIO_REQUEUE)
- bio_endio(bio, md->barrier_error);
- else {
- spin_lock_irq(&md->deferred_lock);
- bio_list_add_head(&md->deferred, bio);
- spin_unlock_irq(&md->deferred_lock);
- }
+ /* issue data + REQ_FUA */
+ bio->bi_rw &= ~REQ_FLUSH;
+ __split_and_process_bio(md, bio);
}

/*
@@ -2469,8 +2464,8 @@ static void dm_wq_work(struct work_struct *work)
if (dm_request_based(md))
generic_make_request(c);
else {
- if (c->bi_rw & REQ_HARDBARRIER)
- process_barrier(md, c);
+ if (c->bi_rw & REQ_FLUSH)
+ process_flush(md, c);
else
__split_and_process_bio(md, c);
}
--
1.7.1

2010-08-30 13:28:59

by Mike Snitzer

[permalink] [raw]

Subject: Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm

On Mon, Aug 30 2010 at 5:58am -0400,
Tejun Heo <[email protected]> wrote:

> This patch converts request-based dm to support the new REQ_FLUSH/FUA.
...
> This patch rips out special flush code path and deals handles
> REQ_FLUSH/FUA requests the same way as other requests. The only
> special treatment is that REQ_FLUSH requests use the block address 0
> when finding target, which is enough for now.

Looks very comparable to the patch I prepared but I have 2 observations
below (based on my findings from testing my patch).

> @@ -1562,22 +1483,15 @@ static int setup_clone(struct request *clone, struct request *rq,
> {
> int r;
>
> - if (dm_rq_is_flush_request(rq)) {
> - blk_rq_init(NULL, clone);
> - clone->cmd_type = REQ_TYPE_FS;
> - clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
> - } else {
> - r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
> - dm_rq_bio_constructor, tio);
> - if (r)
> - return r;
> -
> - clone->cmd = rq->cmd;
> - clone->cmd_len = rq->cmd_len;
> - clone->sense = rq->sense;
> - clone->buffer = rq->buffer;
> - }
> + r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
> + dm_rq_bio_constructor, tio);
> + if (r)
> + return r;
>
> + clone->cmd = rq->cmd;
> + clone->cmd_len = rq->cmd_len;
> + clone->sense = rq->sense;
> + clone->buffer = rq->buffer;
> clone->end_io = end_clone_request;
> clone->end_io_data = tio;

blk_rq_prep_clone() of a REQ_FLUSH request will result in a
rq_data_dir(clone) of read.

I still had the following:

if (rq->cmd_flags & REQ_FLUSH) {
blk_rq_init(NULL, clone);
clone->cmd_type = REQ_TYPE_FS;
/* without this the clone has a rq_data_dir of 0 */
clone->cmd_flags |= WRITE_FLUSH;
} else {
r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
dm_rq_bio_constructor, tio);
...

Request-based DM's REQ_FLUSH still works without this special casing but
I figured I'd raise this to ask: what is the proper rq_data_dir() is for
a REQ_FLUSH?

> @@ -1709,15 +1621,12 @@ static void dm_request_fn(struct request_queue *q)
> if (!rq)
> goto plug_and_out;
>
> - if (unlikely(dm_rq_is_flush_request(rq))) {
> - BUG_ON(md->flush_request);
> - md->flush_request = rq;
> - blk_start_request(rq);
> - queue_work(md->wq, &md->barrier_work);
> - goto out;
> - }
> + /* always use block 0 to find the target for flushes for now */
> + pos = 0;
> + if (!(rq->cmd_flags & REQ_FLUSH))
> + pos = blk_rq_pos(rq);
>
> - ti = dm_table_find_target(map, blk_rq_pos(rq));
> + ti = dm_table_find_target(map, pos);

I added the following here: BUG_ON(!dm_target_is_valid(ti));

> if (ti->type->busy && ti->type->busy(ti))
> goto plug_and_out;

I also needed to avoid the ->busy call for REQ_FLUSH:

if (!(rq->cmd_flags & REQ_FLUSH)) {
ti = dm_table_find_target(map, blk_rq_pos(rq));
BUG_ON(!dm_target_is_valid(ti));
if (ti->type->busy && ti->type->busy(ti))
goto plug_and_out;
} else {
/* rq-based only ever has one target! leverage this for FLUSH */
ti = dm_table_get_target(map, 0);
}

If I allowed ->busy to be called for REQ_FLUSH it would result in a
deadlock. I haven't identified where/why yet.

Other than these remaining issues this patch looks good.

Thanks,
Mike

2010-08-30 13:59:42

[permalink] [raw]

Subject: Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm

Hello,

On 08/30/2010 03:28 PM, Mike Snitzer wrote:
>> + clone->cmd = rq->cmd;
>> + clone->cmd_len = rq->cmd_len;
>> + clone->sense = rq->sense;
>> + clone->buffer = rq->buffer;
>> clone->end_io = end_clone_request;
>> clone->end_io_data = tio;
>
> blk_rq_prep_clone() of a REQ_FLUSH request will result in a
> rq_data_dir(clone) of read.

Hmmm... why? blk_rq_prep_clone() copies all REQ_* flags in
REQ_CLONE_MASK and REQ_WRITE is definitely there. I'll check.

> I still had the following:
>
> if (rq->cmd_flags & REQ_FLUSH) {
> blk_rq_init(NULL, clone);
> clone->cmd_type = REQ_TYPE_FS;
> /* without this the clone has a rq_data_dir of 0 */
> clone->cmd_flags |= WRITE_FLUSH;
> } else {
> r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
> dm_rq_bio_constructor, tio);
> ...
>
> Request-based DM's REQ_FLUSH still works without this special casing but
> I figured I'd raise this to ask: what is the proper rq_data_dir() is for
> a REQ_FLUSH?

Technically block layer doesn't care one way or the other but WRITE
definitely. Maybe it would be a good idea to enforce that from block
layer.

>> @@ -1709,15 +1621,12 @@ static void dm_request_fn(struct request_queue *q)
>> if (!rq)
>> goto plug_and_out;
>>
>> - if (unlikely(dm_rq_is_flush_request(rq))) {
>> - BUG_ON(md->flush_request);
>> - md->flush_request = rq;
>> - blk_start_request(rq);
>> - queue_work(md->wq, &md->barrier_work);
>> - goto out;
>> - }
>> + /* always use block 0 to find the target for flushes for now */
>> + pos = 0;
>> + if (!(rq->cmd_flags & REQ_FLUSH))
>> + pos = blk_rq_pos(rq);
>>
>> - ti = dm_table_find_target(map, blk_rq_pos(rq));
>> + ti = dm_table_find_target(map, pos);
>
> I added the following here: BUG_ON(!dm_target_is_valid(ti));

I'll add it.

>> if (ti->type->busy && ti->type->busy(ti))
>> goto plug_and_out;
>
> I also needed to avoid the ->busy call for REQ_FLUSH:
>
> if (!(rq->cmd_flags & REQ_FLUSH)) {
> ti = dm_table_find_target(map, blk_rq_pos(rq));
> BUG_ON(!dm_target_is_valid(ti));
> if (ti->type->busy && ti->type->busy(ti))
> goto plug_and_out;
> } else {
> /* rq-based only ever has one target! leverage this for FLUSH */
> ti = dm_table_get_target(map, 0);
> }
>
> If I allowed ->busy to be called for REQ_FLUSH it would result in a
> deadlock. I haven't identified where/why yet.

Ah... that's probably from "if (!elv_queue_empty(q))" check below,
flushes are on a separate queue but I forgot to update
elv_queue_empty() to check the flush queue. elv_queue_empty() can
return %true spuriously in which case the queue won't be plugged and
restarted later leading to queue hang. I'll fix elv_queue_empty().

Thanks.

--
tejun

2010-08-30 15:08:12

[permalink] [raw]

Subject: Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm

On 08/30/2010 03:59 PM, Tejun Heo wrote:
> Ah... that's probably from "if (!elv_queue_empty(q))" check below,
> flushes are on a separate queue but I forgot to update
> elv_queue_empty() to check the flush queue. elv_queue_empty() can
> return %true spuriously in which case the queue won't be plugged and
> restarted later leading to queue hang. I'll fix elv_queue_empty().

I think I was too quick to blame elv_queue_empty(). Can you please
test whether the following patch fixes the hang?

Thanks.

---
block/blk-flush.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)

Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -28,7 +28,8 @@ unsigned blk_flush_cur_seq(struct reques
}

static struct request *blk_flush_complete_seq(struct request_queue *q,
- unsigned seq, int error)
+ unsigned seq, int error,
+ bool from_end_io)
{
struct request *next_rq = NULL;

@@ -51,6 +52,13 @@ static struct request *blk_flush_complet
if (!list_empty(&q->pending_flushes)) {
next_rq = list_entry_rq(q->pending_flushes.next);
list_move(&next_rq->queuelist, &q->queue_head);
+ /*
+ * Moving a request silently to queue_head may
+ * stall the queue, kick the queue if we
+ * aren't in the issue path already.
+ */
+ if (from_end_io)
+ __blk_run_queue(q);
}
}
return next_rq;
@@ -59,19 +67,19 @@ static struct request *blk_flush_complet
static void pre_flush_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error);
+ blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error, true);
}

static void flush_data_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error);
+ blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error, true);
}

static void post_flush_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
+ blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error, true);
}

static void init_flush_request(struct request *rq, struct gendisk *disk)
@@ -165,7 +173,7 @@ struct request *blk_do_flush(struct requ
skip |= QUEUE_FSEQ_DATA;
if (!do_postflush)
skip |= QUEUE_FSEQ_POSTFLUSH;
- return blk_flush_complete_seq(q, skip, 0);
+ return blk_flush_complete_seq(q, skip, 0, false);
}

static void bio_end_flush(struct bio *bio, int err)

2010-08-30 15:43:23

[permalink] [raw]

Subject: [PATCH] block: initialize flush request with WRITE_FLUSH instead of REQ_FLUSH

init_flush_request() only set REQ_FLUSH when initializing flush
requests making them READ requests. Use WRITE_FLUSH instead.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Mike Snitzer <[email protected]>
---
So, this was the culprit for the incorrect data direction for flush
requests.

Thanks.

block/blk-flush.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -77,7 +77,7 @@ static void post_flush_end_io(struct req
static void init_flush_request(struct request *rq, struct gendisk *disk)
{
rq->cmd_type = REQ_TYPE_FS;
- rq->cmd_flags = REQ_FLUSH;
+ rq->cmd_flags = WRITE_FLUSH;
rq->rq_disk = disk;
}

2010-08-30 15:45:30

[permalink] [raw]

Subject: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm

This patch converts request-based dm to support the new REQ_FLUSH/FUA.

The original request-based flush implementation depended on
request_queue blocking other requests while a barrier sequence is in
progress, which is no longer true for the new REQ_FLUSH/FUA.

In general, request-based dm doesn't have infrastructure for cloning
one source request to multiple targets, but the original flush
implementation had a special mostly independent path which can issue
flushes to multiple targets and sequence them. However, the
capability isn't currently in use and adds a lot of complexity.
Moreoever, it's unlikely to be useful in its current form as it
doesn't make sense to be able to send out flushes to multiple targets
when write requests can't be.

This patch rips out special flush code path and deals handles
REQ_FLUSH/FUA requests the same way as other requests. The only
special treatment is that REQ_FLUSH requests use the block address 0
when finding target, which is enough for now.

* added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
suggested by Mike Snitzer

Signed-off-by: Tejun Heo <[email protected]>
Cc: Mike Snitzer <[email protected]>
---
Here's a version w/ BUG_ON() added. Once the queue hang issue is
tracked down, I'll refresh the whole series and repost.

Thanks.

drivers/md/dm.c | 206 +++++---------------------------------------------------
1 file changed, 22 insertions(+), 184 deletions(-)

Index: block/drivers/md/dm.c
===================================================================
--- block.orig/drivers/md/dm.c
+++ block/drivers/md/dm.c
@@ -143,20 +143,9 @@ struct mapped_device {
spinlock_t deferred_lock;

/*
- * Protect barrier_error from concurrent endio processing
- * in request-based dm.
- */
- spinlock_t barrier_error_lock;
- int barrier_error;
-
- /*
- * Processing queue (flush/barriers)
+ * Processing queue (flush)
*/
struct workqueue_struct *wq;
- struct work_struct barrier_work;
-
- /* A pointer to the currently processing pre/post flush request */
- struct request *flush_request;

/*
* The current mapping.
@@ -732,23 +721,6 @@ static void end_clone_bio(struct bio *cl
blk_update_request(tio->orig, 0, nr_bytes);
}

-static void store_barrier_error(struct mapped_device *md, int error)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&md->barrier_error_lock, flags);
- /*
- * Basically, the first error is taken, but:
- * -EOPNOTSUPP supersedes any I/O error.
- * Requeue request supersedes any I/O error but -EOPNOTSUPP.
- */
- if (!md->barrier_error || error == -EOPNOTSUPP ||
- (md->barrier_error != -EOPNOTSUPP &&
- error == DM_ENDIO_REQUEUE))
- md->barrier_error = error;
- spin_unlock_irqrestore(&md->barrier_error_lock, flags);
-}
-
/*
* Don't touch any member of the md after calling this function because
* the md may be freed in dm_put() at the end of this function.
@@ -786,13 +758,11 @@ static void free_rq_clone(struct request
static void dm_end_request(struct request *clone, int error)
{
int rw = rq_data_dir(clone);
- int run_queue = 1;
- bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
struct dm_rq_target_io *tio = clone->end_io_data;
struct mapped_device *md = tio->md;
struct request *rq = tio->orig;

- if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+ if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
rq->errors = clone->errors;
rq->resid_len = clone->resid_len;

@@ -806,15 +776,8 @@ static void dm_end_request(struct reques
}

free_rq_clone(clone);
-
- if (unlikely(is_barrier)) {
- if (unlikely(error))
- store_barrier_error(md, error);
- run_queue = 0;
- } else
- blk_end_request_all(rq, error);
-
- rq_completed(md, rw, run_queue);
+ blk_end_request_all(rq, error);
+ rq_completed(md, rw, true);
}

static void dm_unprep_request(struct request *rq)
@@ -839,16 +802,6 @@ void dm_requeue_unmapped_request(struct
struct request_queue *q = rq->q;
unsigned long flags;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
- /*
- * Barrier clones share an original request.
- * Leave it to dm_end_request(), which handles this special
- * case.
- */
- dm_end_request(clone, DM_ENDIO_REQUEUE);
- return;
- }
-
dm_unprep_request(rq);

spin_lock_irqsave(q->queue_lock, flags);
@@ -938,19 +891,6 @@ static void dm_complete_request(struct r
struct dm_rq_target_io *tio = clone->end_io_data;
struct request *rq = tio->orig;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
- /*
- * Barrier clones share an original request. So can't use
- * softirq_done with the original.
- * Pass the clone to dm_done() directly in this special case.
- * It is safe (even if clone->q->queue_lock is held here)
- * because there is no I/O dispatching during the completion
- * of barrier clone.
- */
- dm_done(clone, error, true);
- return;
- }
-
tio->error = error;
rq->completion_data = clone;
blk_complete_request(rq);
@@ -967,17 +907,6 @@ void dm_kill_unmapped_request(struct req
struct dm_rq_target_io *tio = clone->end_io_data;
struct request *rq = tio->orig;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
- /*
- * Barrier clones share an original request.
- * Leave it to dm_end_request(), which handles this special
- * case.
- */
- BUG_ON(error > 0);
- dm_end_request(clone, error);
- return;
- }
-
rq->cmd_flags |= REQ_FAILED;
dm_complete_request(clone, error);
}
@@ -1507,14 +1436,6 @@ static int dm_request(struct request_que
return _dm_request(q, bio);
}

-static bool dm_rq_is_flush_request(struct request *rq)
-{
- if (rq->cmd_flags & REQ_FLUSH)
- return true;
- else
- return false;
-}
-
void dm_dispatch_request(struct request *rq)
{
int r;
@@ -1562,22 +1483,15 @@ static int setup_clone(struct request *c
{
int r;

- if (dm_rq_is_flush_request(rq)) {
- blk_rq_init(NULL, clone);
- clone->cmd_type = REQ_TYPE_FS;
- clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
- } else {
- r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
- dm_rq_bio_constructor, tio);
- if (r)
- return r;
-
- clone->cmd = rq->cmd;
- clone->cmd_len = rq->cmd_len;
- clone->sense = rq->sense;
- clone->buffer = rq->buffer;
- }
+ r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
+ dm_rq_bio_constructor, tio);
+ if (r)
+ return r;

+ clone->cmd = rq->cmd;
+ clone->cmd_len = rq->cmd_len;
+ clone->sense = rq->sense;
+ clone->buffer = rq->buffer;
clone->end_io = end_clone_request;
clone->end_io_data = tio;

@@ -1618,9 +1532,6 @@ static int dm_prep_fn(struct request_que
struct mapped_device *md = q->queuedata;
struct request *clone;

- if (unlikely(dm_rq_is_flush_request(rq)))
- return BLKPREP_OK;
-
if (unlikely(rq->special)) {
DMWARN("Already has something in rq->special.");
return BLKPREP_KILL;
@@ -1697,6 +1608,7 @@ static void dm_request_fn(struct request
struct dm_table *map = dm_get_live_table(md);
struct dm_target *ti;
struct request *rq, *clone;
+ sector_t pos;

/*
* For suspend, check blk_queue_stopped() and increment
@@ -1709,15 +1621,14 @@ static void dm_request_fn(struct request
if (!rq)
goto plug_and_out;

- if (unlikely(dm_rq_is_flush_request(rq))) {
- BUG_ON(md->flush_request);
- md->flush_request = rq;
- blk_start_request(rq);
- queue_work(md->wq, &md->barrier_work);
- goto out;
- }
+ /* always use block 0 to find the target for flushes for now */
+ pos = 0;
+ if (!(rq->cmd_flags & REQ_FLUSH))
+ pos = blk_rq_pos(rq);
+
+ ti = dm_table_find_target(map, pos);
+ BUG_ON(!dm_target_is_valid(ti));

- ti = dm_table_find_target(map, blk_rq_pos(rq));
if (ti->type->busy && ti->type->busy(ti))
goto plug_and_out;

@@ -1888,7 +1799,6 @@ out:
static const struct block_device_operations dm_blk_dops;

static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);

static void dm_init_md_queue(struct mapped_device *md)
{
@@ -1943,7 +1853,6 @@ static struct mapped_device *alloc_dev(i
mutex_init(&md->suspend_lock);
mutex_init(&md->type_lock);
spin_lock_init(&md->deferred_lock);
- spin_lock_init(&md->barrier_error_lock);
rwlock_init(&md->map_lock);
atomic_set(&md->holders, 1);
atomic_set(&md->open_count, 0);
@@ -1966,7 +1875,6 @@ static struct mapped_device *alloc_dev(i
atomic_set(&md->pending[1], 0);
init_waitqueue_head(&md->wait);
INIT_WORK(&md->work, dm_wq_work);
- INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
init_waitqueue_head(&md->eventq);

md->disk->major = _major;
@@ -2220,8 +2128,6 @@ static int dm_init_request_based_queue(s
blk_queue_softirq_done(md->queue, dm_softirq_done);
blk_queue_prep_rq(md->queue, dm_prep_fn);
blk_queue_lld_busy(md->queue, dm_lld_busy);
- /* no flush support for request based dm yet */
- blk_queue_flush(md->queue, 0);

elv_register_queue(md->queue);

@@ -2421,73 +2327,6 @@ static void dm_queue_flush(struct mapped
queue_work(md->wq, &md->work);
}

-static void dm_rq_set_target_request_nr(struct request *clone, unsigned request_nr)
-{
- struct dm_rq_target_io *tio = clone->end_io_data;
-
- tio->info.target_request_nr = request_nr;
-}
-
-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
-{
- int i, j;
- struct dm_table *map = dm_get_live_table(md);
- unsigned num_targets = dm_table_get_num_targets(map);
- struct dm_target *ti;
- struct request *clone;
-
- md->barrier_error = 0;
-
- for (i = 0; i < num_targets; i++) {
- ti = dm_table_get_target(map, i);
- for (j = 0; j < ti->num_flush_requests; j++) {
- clone = clone_rq(md->flush_request, md, GFP_NOIO);
- dm_rq_set_target_request_nr(clone, j);
- atomic_inc(&md->pending[rq_data_dir(clone)]);
- map_request(ti, clone, md);
- }
- }
-
- dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
- dm_table_put(map);
-
- return md->barrier_error;
-}
-
-static void dm_rq_barrier_work(struct work_struct *work)
-{
- int error;
- struct mapped_device *md = container_of(work, struct mapped_device,
- barrier_work);
- struct request_queue *q = md->queue;
- struct request *rq;
- unsigned long flags;
-
- /*
- * Hold the md reference here and leave it at the last part so that
- * the md can't be deleted by device opener when the barrier request
- * completes.
- */
- dm_get(md);
-
- error = dm_rq_barrier(md);
-
- rq = md->flush_request;
- md->flush_request = NULL;
-
- if (error == DM_ENDIO_REQUEUE) {
- spin_lock_irqsave(q->queue_lock, flags);
- blk_requeue_request(q, rq);
- spin_unlock_irqrestore(q->queue_lock, flags);
- } else
- blk_end_request_all(rq, error);
-
- blk_run_queue(q);
-
- dm_put(md);
-}
-
/*
* Swap in a new table, returning the old one for the caller to destroy.
*/
@@ -2619,9 +2458,8 @@ int dm_suspend(struct mapped_device *md,
up_write(&md->io_lock);

/*
- * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
- * can be kicked until md->queue is stopped. So stop md->queue before
- * flushing md->wq.
+ * Stop md->queue before flushing md->wq in case request-based
+ * dm defers requests to md->wq from md->queue.
*/
if (dm_request_based(md))
stop_queue(md->queue);

2010-08-30 19:08:53

by Mike Snitzer

[permalink] [raw]

Subject: Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm

On Mon, Aug 30 2010 at 11:07am -0400,
Tejun Heo <[email protected]> wrote:

> On 08/30/2010 03:59 PM, Tejun Heo wrote:
> > Ah... that's probably from "if (!elv_queue_empty(q))" check below,
> > flushes are on a separate queue but I forgot to update
> > elv_queue_empty() to check the flush queue. elv_queue_empty() can
> > return %true spuriously in which case the queue won't be plugged and
> > restarted later leading to queue hang. I'll fix elv_queue_empty().
>
> I think I was too quick to blame elv_queue_empty(). Can you please
> test whether the following patch fixes the hang?

It does, thanks!

Tested-by: Mike Snitzer <[email protected]>

2010-08-30 19:18:45

by Mike Snitzer

[permalink] [raw]

Subject: Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm

On Mon, Aug 30 2010 at 11:45am -0400,
Tejun Heo <[email protected]> wrote:

> This patch converts request-based dm to support the new REQ_FLUSH/FUA.
>
> The original request-based flush implementation depended on
> request_queue blocking other requests while a barrier sequence is in
> progress, which is no longer true for the new REQ_FLUSH/FUA.
>
> In general, request-based dm doesn't have infrastructure for cloning
> one source request to multiple targets, but the original flush
> implementation had a special mostly independent path which can issue
> flushes to multiple targets and sequence them. However, the
> capability isn't currently in use and adds a lot of complexity.
> Moreoever, it's unlikely to be useful in its current form as it
> doesn't make sense to be able to send out flushes to multiple targets
> when write requests can't be.
>
> This patch rips out special flush code path and deals handles
> REQ_FLUSH/FUA requests the same way as other requests. The only
> special treatment is that REQ_FLUSH requests use the block address 0
> when finding target, which is enough for now.
>
> * added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
> suggested by Mike Snitzer
>
> Signed-off-by: Tejun Heo <[email protected]>

Looks good.

Acked-by: Mike Snitzer <[email protected]>

Junichi and/or Kiyoshi,
Could you please review this patch and add your Acked-by if it is OK?
(Alasdair will want to see NEC's Ack to accept this patch).

Thanks,
Mike

2010-08-30 21:28:38

by Mike Snitzer

[permalink] [raw]

Subject: Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm

On Mon, Aug 30 2010 at 3:08pm -0400,
Mike Snitzer <[email protected]> wrote:

> On Mon, Aug 30 2010 at 11:07am -0400,
> Tejun Heo <[email protected]> wrote:
>
> > On 08/30/2010 03:59 PM, Tejun Heo wrote:
> > > Ah... that's probably from "if (!elv_queue_empty(q))" check below,
> > > flushes are on a separate queue but I forgot to update
> > > elv_queue_empty() to check the flush queue. elv_queue_empty() can
> > > return %true spuriously in which case the queue won't be plugged and
> > > restarted later leading to queue hang. I'll fix elv_queue_empty().
> >
> > I think I was too quick to blame elv_queue_empty(). Can you please
> > test whether the following patch fixes the hang?
>
> It does, thanks!

Hmm, but unfortunately I was too quick to say the patch fixed the hang.

It is much more rare, but I can still get a hang. I just got the
following running vgcreate against an DM mpath (rq-based) device:

INFO: task vgcreate:3517 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
vgcreate D ffff88003d677a00 5168 3517 3361 0x00000080
ffff88003d677998 0000000000000046 ffff880000000000 ffff88003d677fd8
ffff880039c84860 ffff88003d677fd8 00000000001d3880 ffff880039c84c30
ffff880039c84c28 00000000001d3880 00000000001d3880 ffff88003d677fd8
Call Trace:
[<ffffffff81389308>] io_schedule+0x73/0xb5
[<ffffffff811c7304>] get_request_wait+0xef/0x17d
[<ffffffff810642be>] ? autoremove_wake_function+0x0/0x39
[<ffffffff811c7890>] __make_request+0x333/0x467
[<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
[<ffffffff811c5e91>] generic_make_request+0x342/0x3bf
[<ffffffff81074714>] ? trace_hardirqs_off+0xd/0xf
[<ffffffff81069df2>] ? local_clock+0x41/0x5a
[<ffffffff811c5fe9>] submit_bio+0xdb/0xf8
[<ffffffff810754a4>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff811381a6>] dio_bio_submit+0x7b/0x9c
[<ffffffff81138dbe>] __blockdev_direct_IO+0x7f3/0x97d
[<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
[<ffffffff81136d7a>] blkdev_direct_IO+0x57/0x59
[<ffffffff81135f58>] ? blkdev_get_blocks+0x0/0x90
[<ffffffff810ce301>] generic_file_aio_read+0xed/0x5b4
[<ffffffff81077932>] ? lock_release_non_nested+0xd5/0x23b
[<ffffffff810e40f8>] ? might_fault+0x5c/0xac
[<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
[<ffffffff8110e131>] do_sync_read+0xcb/0x108
[<ffffffff81074688>] ? trace_hardirqs_off_caller+0x1f/0x9e
[<ffffffff81389a99>] ? __mutex_unlock_slowpath+0x120/0x132
[<ffffffff8119d805>] ? fsnotify_perm+0x4a/0x50
[<ffffffff8119d86c>] ? security_file_permission+0x2e/0x33
[<ffffffff8110e7a3>] vfs_read+0xab/0x107
[<ffffffff81075473>] ? trace_hardirqs_on_caller+0x11d/0x141
[<ffffffff8110e8c2>] sys_read+0x4d/0x74
[<ffffffff81002c32>] system_call_fastpath+0x16/0x1b
no locks held by vgcreate/3517.

I was then able to reproduce it after reboot and another ~5 attempts
(all against 2.6.36-rc2 + your latest FLUSH+FUA patchset and DM
patches).

crash> bt -l 3893
PID: 3893 TASK: ffff88003e65a430 CPU: 0 COMMAND: "vgcreate"
#0 [ffff88003a5298d8] schedule at ffffffff813891d3
/root/git/linux-2.6/kernel/sched.c: 2873
#1 [ffff88003a5299a0] io_schedule at ffffffff81389308
/root/git/linux-2.6/kernel/sched.c: 5128
#2 [ffff88003a5299c0] get_request_wait at ffffffff811c7304
/root/git/linux-2.6/block/blk-core.c: 879
#3 [ffff88003a529a50] __make_request at ffffffff811c7890
/root/git/linux-2.6/block/blk-core.c: 1301
#4 [ffff88003a529ac0] generic_make_request at ffffffff811c5e91
/root/git/linux-2.6/block/blk-core.c: 1536
#5 [ffff88003a529b70] submit_bio at ffffffff811c5fe9
/root/git/linux-2.6/block/blk-core.c: 1632
#6 [ffff88003a529bc0] dio_bio_submit at ffffffff811381a6
/root/git/linux-2.6/fs/direct-io.c: 375
#7 [ffff88003a529bf0] __blockdev_direct_IO at ffffffff81138dbe
/root/git/linux-2.6/fs/direct-io.c: 1087
#8 [ffff88003a529cd0] blkdev_direct_IO at ffffffff81136d7a
/root/git/linux-2.6/fs/block_dev.c: 177
#9 [ffff88003a529d10] generic_file_aio_read at ffffffff810ce301
/root/git/linux-2.6/mm/filemap.c: 1303
#10 [ffff88003a529df0] do_sync_read at ffffffff8110e131
/root/git/linux-2.6/fs/read_write.c: 282
#11 [ffff88003a529f00] vfs_read at ffffffff8110e7a3
/root/git/linux-2.6/fs/read_write.c: 310
#12 [ffff88003a529f40] sys_read at ffffffff8110e8c2
/root/git/linux-2.6/fs/read_write.c: 388
#13 [ffff88003a529f80] system_call_fastpath at ffffffff81002c32
/root/git/linux-2.6/arch/x86/kernel/entry_64.S: 488
RIP: 0000003b602d41a0 RSP: 00007fff55d5b928 RFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffff81002c32 RCX: 00007fff55d5b960
RDX: 0000000000001000 RSI: 00007fff55d5a000 RDI: 0000000000000005
RBP: 0000000000000000 R8: 0000000000494ecd R9: 0000000000001000
R10: 000000315c41c160 R11: 0000000000000246 R12: 00007fff55d5a000
R13: 00007fff55d5b0a0 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b

2010-08-31 10:30:25

[permalink] [raw]

Subject: Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm

On 08/30/2010 11:28 PM, Mike Snitzer wrote:
> On Mon, Aug 30 2010 at 3:08pm -0400,
> Mike Snitzer <[email protected]> wrote:
>
>> On Mon, Aug 30 2010 at 11:07am -0400,
>> Tejun Heo <[email protected]> wrote:
>>
>>> On 08/30/2010 03:59 PM, Tejun Heo wrote:
>>>> Ah... that's probably from "if (!elv_queue_empty(q))" check below,
>>>> flushes are on a separate queue but I forgot to update
>>>> elv_queue_empty() to check the flush queue. elv_queue_empty() can
>>>> return %true spuriously in which case the queue won't be plugged and
>>>> restarted later leading to queue hang. I'll fix elv_queue_empty().
>>>
>>> I think I was too quick to blame elv_queue_empty(). Can you please
>>> test whether the following patch fixes the hang?
>>
>> It does, thanks!
>
> Hmm, but unfortunately I was too quick to say the patch fixed the hang.
>
> It is much more rare, but I can still get a hang. I just got the
> following running vgcreate against an DM mpath (rq-based) device:

Can you please try this one instead?

Thanks.

---
block/blk-flush.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)

Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -56,22 +56,38 @@ static struct request *blk_flush_complet
return next_rq;
}

+static void blk_flush_complete_seq_end_io(struct request_queue *q,
+ unsigned seq, int error)
+{
+ bool was_empty = elv_queue_empty(q);
+ struct request *next_rq;
+
+ next_rq = blk_flush_complete_seq(q, seq, error);
+
+ /*
+ * Moving a request silently to empty queue_head may stall the
+ * queue. Kick the queue in those cases.
+ */
+ if (next_rq && was_empty)
+ __blk_run_queue(q);
+}
+
static void pre_flush_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error);
+ blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_PREFLUSH, error);
}

static void flush_data_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error);
+ blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_DATA, error);
}

static void post_flush_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
+ blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
}

static void init_flush_request(struct request *rq, struct gendisk *disk)

--
tejun

2010-08-31 13:03:19

by Mike Snitzer

[permalink] [raw]

Subject: Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm

On Tue, Aug 31 2010 at 6:29am -0400,
Tejun Heo <[email protected]> wrote:

> On 08/30/2010 11:28 PM, Mike Snitzer wrote:
> > Hmm, but unfortunately I was too quick to say the patch fixed the hang.
> >
> > It is much more rare, but I can still get a hang. I just got the
> > following running vgcreate against an DM mpath (rq-based) device:
>
> Can you please try this one instead?

Still hit the hang on the 5th iteration of my test:
while true ; do ./test_dm_discard_mpath.sh && sleep 1 ; done

Would you like me to (re)send my test script offlist?

INFO: task vgcreate:2617 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
vgcreate D ffff88007bf7ba00 4688 2617 2479 0x00000080
ffff88007bf7b998 0000000000000046 ffff880000000000 ffff88007bf7bfd8
ffff88005542a430 ffff88007bf7bfd8 00000000001d3880 ffff88005542a800
ffff88005542a7f8 00000000001d3880 00000000001d3880 ffff88007bf7bfd8
Call Trace:
[<ffffffff81389338>] io_schedule+0x73/0xb5
[<ffffffff811c7304>] get_request_wait+0xef/0x17d
[<ffffffff810642be>] ? autoremove_wake_function+0x0/0x39
[<ffffffff811c7890>] __make_request+0x333/0x467
[<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
[<ffffffff811c5e91>] generic_make_request+0x342/0x3bf
[<ffffffff81074714>] ? trace_hardirqs_off+0xd/0xf
[<ffffffff81069df2>] ? local_clock+0x41/0x5a
[<ffffffff811c5fe9>] submit_bio+0xdb/0xf8
[<ffffffff810754a4>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff811381a6>] dio_bio_submit+0x7b/0x9c
[<ffffffff81138dbe>] __blockdev_direct_IO+0x7f3/0x97d
[<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
[<ffffffff81136d7a>] blkdev_direct_IO+0x57/0x59
[<ffffffff81135f58>] ? blkdev_get_blocks+0x0/0x90
[<ffffffff810ce301>] generic_file_aio_read+0xed/0x5b4
[<ffffffff81077932>] ? lock_release_non_nested+0xd5/0x23b
[<ffffffff810e40f8>] ? might_fault+0x5c/0xac
[<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
[<ffffffff8110e131>] do_sync_read+0xcb/0x108
[<ffffffff81074688>] ? trace_hardirqs_off_caller+0x1f/0x9e
[<ffffffff81389ac9>] ? __mutex_unlock_slowpath+0x120/0x132
[<ffffffff8119d805>] ? fsnotify_perm+0x4a/0x50
[<ffffffff8119d86c>] ? security_file_permission+0x2e/0x33
[<ffffffff8110e7a3>] vfs_read+0xab/0x107
[<ffffffff81075473>] ? trace_hardirqs_on_caller+0x11d/0x141
[<ffffffff8110e8c2>] sys_read+0x4d/0x74
[<ffffffff81002c32>] system_call_fastpath+0x16/0x1b
no locks held by vgcreate/2617.

2010-08-31 13:15:08

[permalink] [raw]

Subject: Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm

Hello,

On 08/31/2010 03:02 PM, Mike Snitzer wrote:
> On Tue, Aug 31 2010 at 6:29am -0400,
> Tejun Heo <[email protected]> wrote:
>
>> On 08/30/2010 11:28 PM, Mike Snitzer wrote:
>>> Hmm, but unfortunately I was too quick to say the patch fixed the hang.
>>>
>>> It is much more rare, but I can still get a hang. I just got the
>>> following running vgcreate against an DM mpath (rq-based) device:
>>
>> Can you please try this one instead?
>
> Still hit the hang on the 5th iteration of my test:
> while true ; do ./test_dm_discard_mpath.sh && sleep 1 ; done
>
> Would you like me to (re)send my test script offlist?

Yes, please. Thanks.

--
tejun