Hello, Jens, James, Jeff and Bartlomiej.
This is the fifth posting of blk ordered reimplementation. The
last posting was on 19th, Oct.
http://marc.theaimsgroup.com/?l=linux-kernel&m=112972609907761&w=2
The generic dispatch queue patchset which is required for this ordered
patchset has made it into the main tree.
http://marc.theaimsgroup.com/?l=linux-kernel&m=112972555921965&w=2
Changes from the previous post are...
* As core blk files have moved from drivers/block/ to block/, patches
are updated accordingly.
* In 04_blk_scsi-update-ordered, sd_prepare_flush() is updated to
initialize rq->cmd_len properly. This used to work like this
previously but something broke it. Note that the original flush
code is broken too. James, can you please verify this?
* libata has gone through some changes since last posting. libata
updates are rewritten. Jeff, please review #07 and #08.
Other than above three, nothing has changed. Jens, after reviewing,
can we push these into -mm?
Thanks. :-)
[ Start of patch descriptions ]
01_blk_add-uptodate-to-end_that_request_last.patch
: add @uptodate to end_that_request_last() and @error to rq_end_io_fn()
Add @uptodate argument to end_that_request_last() and @error
to rq_end_io_fn(). There's no generic way to pass error code
to request completion function, making generic error handling
of non-fs request difficult (rq->errors is driver-specific and
each driver uses it differently). This patch adds @uptodate
to end_that_request_last() and @error to rq_end_io_fn().
For fs requests, this doesn't really matter, so just using the
same uptodate argument used in the last call to
end_that_request_first() should suffice. IMHO, this can also
help the generic command-carrying request Jens is working on.
02_blk_implement-init_request_from_bio.patch
: separate out bio init part from __make_request
Separate out bio initialization part from __make_request. It
will be used by the following blk_ordered_reimpl.
03_blk_reimplement-ordered.patch
: reimplement handling of barrier request
Reimplement handling of barrier requests.
* Flexible handling to deal with various capabilities of
target devices.
* Retry support for falling back.
* Tagged queues which don't support ordered tag can do ordered.
04_blk_scsi-update-ordered.patch
: update SCSI to use new blk_ordered
All ordered request related stuff delegated to HLD. Midlayer
now doens't deal with ordered setting or prepare_flush
callback. sd.c updated to deal with blk_queue_ordered
setting. Currently, ordered tag isn't used as SCSI midlayer
cannot guarantee request ordering.
05_blk_scsi-add-fua-support.patch
: add FUA support to SCSI disk
Add FUA support to SCSI disk.
06_blk_libata-update-ordered.patch
: update libata to use new blk_ordered
Reflect changes in SCSI midlayer and updated to use new
ordered request implementation
07_blk_libata-add-fua-support.patch
: add FUA support to libata
Add FUA support to libata.
08_blk_ide-update-ordered.patch
: update IDE to use new blk_ordered
Update IDE to use new blk_ordered.
09_blk_ide-add-fua-support.patch
: add FUA support to IDE
Add FUA support to IDE
10_blk_add-barrier-doc.patch
: I/O barrier documentation
I/O barrier documentation
[ End of patch descriptions ]
--
tejun
01_blk_add-uptodate-to-end_that_request_last.patch
Add @uptodate argument to end_that_request_last() and @error
to rq_end_io_fn(). There's no generic way to pass error code
to request completion function, making generic error handling
of non-fs request difficult (rq->errors is driver-specific and
each driver uses it differently). This patch adds @uptodate
to end_that_request_last() and @error to rq_end_io_fn().
For fs requests, this doesn't really matter, so just using the
same uptodate argument used in the last call to
end_that_request_first() should suffice. IMHO, this can also
help the generic command-carrying request Jens is working on.
Signed-off-by: Tejun Heo <[email protected]>
block/elevator.c | 2 +-
block/ll_rw_blk.c | 22 +++++++++++++++-------
drivers/block/DAC960.c | 2 +-
drivers/block/cciss.c | 2 +-
drivers/block/cpqarray.c | 2 +-
drivers/block/floppy.c | 2 +-
drivers/block/nbd.c | 2 +-
drivers/block/sx8.c | 2 +-
drivers/block/ub.c | 2 +-
drivers/block/viodasd.c | 2 +-
drivers/cdrom/cdu31a.c | 2 +-
drivers/ide/ide-cd.c | 4 ++--
drivers/ide/ide-io.c | 6 +++---
drivers/message/i2o/i2o_block.c | 2 +-
drivers/mmc/mmc_block.c | 4 ++--
drivers/s390/block/dasd.c | 2 +-
drivers/s390/char/tape_block.c | 2 +-
drivers/scsi/ide-scsi.c | 4 ++--
drivers/scsi/scsi_lib.c | 2 +-
drivers/scsi/sd.c | 2 +-
include/linux/blkdev.h | 6 +++---
21 files changed, 42 insertions(+), 34 deletions(-)
Index: work/drivers/block/DAC960.c
===================================================================
--- work.orig/drivers/block/DAC960.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/block/DAC960.c 2005-11-18 00:35:04.000000000 +0900
@@ -3471,7 +3471,7 @@ static inline boolean DAC960_ProcessComp
if (!end_that_request_first(Request, UpToDate, Command->BlockCount)) {
- end_that_request_last(Request);
+ end_that_request_last(Request, UpToDate);
if (Command->Completion) {
complete(Command->Completion);
Index: work/drivers/block/cciss.c
===================================================================
--- work.orig/drivers/block/cciss.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/block/cciss.c 2005-11-18 00:35:04.000000000 +0900
@@ -2299,7 +2299,7 @@ static inline void complete_command( ctl
printk("Done with %p\n", cmd->rq);
#endif /* CCISS_DEBUG */
- end_that_request_last(cmd->rq);
+ end_that_request_last(cmd->rq, status ? 1 : -EIO);
cmd_free(h,cmd,1);
}
Index: work/drivers/block/cpqarray.c
===================================================================
--- work.orig/drivers/block/cpqarray.c 2005-11-18 00:06:46.000000000 +0900
+++ work/drivers/block/cpqarray.c 2005-11-18 00:35:04.000000000 +0900
@@ -1036,7 +1036,7 @@ static inline void complete_command(cmdl
complete_buffers(cmd->rq->bio, ok);
DBGPX(printk("Done with %p\n", cmd->rq););
- end_that_request_last(cmd->rq);
+ end_that_request_last(cmd->rq, ok ? 1 : -EIO);
}
/*
Index: work/drivers/block/floppy.c
===================================================================
--- work.orig/drivers/block/floppy.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/block/floppy.c 2005-11-18 00:35:04.000000000 +0900
@@ -2301,7 +2301,7 @@ static void floppy_end_request(struct re
add_disk_randomness(req->rq_disk);
floppy_off((long)req->rq_disk->private_data);
blkdev_dequeue_request(req);
- end_that_request_last(req);
+ end_that_request_last(req, uptodate);
/* We're done with the request */
current_req = NULL;
Index: work/drivers/block/nbd.c
===================================================================
--- work.orig/drivers/block/nbd.c 2005-11-18 00:06:46.000000000 +0900
+++ work/drivers/block/nbd.c 2005-11-18 00:35:04.000000000 +0900
@@ -136,7 +136,7 @@ static void nbd_end_request(struct reque
spin_lock_irqsave(q->queue_lock, flags);
if (!end_that_request_first(req, uptodate, req->nr_sectors)) {
- end_that_request_last(req);
+ end_that_request_last(req, uptodate);
}
spin_unlock_irqrestore(q->queue_lock, flags);
}
Index: work/drivers/block/sx8.c
===================================================================
--- work.orig/drivers/block/sx8.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/block/sx8.c 2005-11-18 00:35:04.000000000 +0900
@@ -770,7 +770,7 @@ static inline void carm_end_request_queu
rc = end_that_request_first(req, uptodate, req->hard_nr_sectors);
assert(rc == 0);
- end_that_request_last(req);
+ end_that_request_last(req, uptodate);
rc = carm_put_request(host, crq);
assert(rc == 0);
Index: work/drivers/block/ub.c
===================================================================
--- work.orig/drivers/block/ub.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/block/ub.c 2005-11-18 00:35:04.000000000 +0900
@@ -942,7 +942,7 @@ static void ub_end_rq(struct request *rq
rc = end_that_request_first(rq, uptodate, rq->hard_nr_sectors);
// assert(rc == 0);
- end_that_request_last(rq);
+ end_that_request_last(rq, uptodate);
}
/*
Index: work/drivers/block/viodasd.c
===================================================================
--- work.orig/drivers/block/viodasd.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/block/viodasd.c 2005-11-18 00:35:04.000000000 +0900
@@ -305,7 +305,7 @@ static void viodasd_end_request(struct r
if (end_that_request_first(req, uptodate, num_sectors))
return;
add_disk_randomness(req->rq_disk);
- end_that_request_last(req);
+ end_that_request_last(req, uptodate);
}
/*
Index: work/drivers/cdrom/cdu31a.c
===================================================================
--- work.orig/drivers/cdrom/cdu31a.c 2005-11-18 00:06:46.000000000 +0900
+++ work/drivers/cdrom/cdu31a.c 2005-11-18 00:35:04.000000000 +0900
@@ -1402,7 +1402,7 @@ static void do_cdu31a_request(request_qu
if (!end_that_request_first(req, 1, nblock)) {
spin_lock_irq(q->queue_lock);
blkdev_dequeue_request(req);
- end_that_request_last(req);
+ end_that_request_last(req, 1);
spin_unlock_irq(q->queue_lock);
}
continue;
Index: work/drivers/ide/ide-cd.c
===================================================================
--- work.orig/drivers/ide/ide-cd.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/ide/ide-cd.c 2005-11-18 00:35:04.000000000 +0900
@@ -614,7 +614,7 @@ static void cdrom_end_request (ide_drive
*/
spin_lock_irqsave(&ide_lock, flags);
end_that_request_chunk(failed, 0, failed->data_len);
- end_that_request_last(failed);
+ end_that_request_last(failed, 0);
spin_unlock_irqrestore(&ide_lock, flags);
}
@@ -1739,7 +1739,7 @@ end_request:
spin_lock_irqsave(&ide_lock, flags);
blkdev_dequeue_request(rq);
- end_that_request_last(rq);
+ end_that_request_last(rq, 1);
HWGROUP(drive)->rq = NULL;
spin_unlock_irqrestore(&ide_lock, flags);
return ide_stopped;
Index: work/drivers/ide/ide-io.c
===================================================================
--- work.orig/drivers/ide/ide-io.c 2005-11-18 00:06:46.000000000 +0900
+++ work/drivers/ide/ide-io.c 2005-11-18 00:35:04.000000000 +0900
@@ -89,7 +89,7 @@ int __ide_end_request(ide_drive_t *drive
blkdev_dequeue_request(rq);
HWGROUP(drive)->rq = NULL;
- end_that_request_last(rq);
+ end_that_request_last(rq, uptodate);
ret = 0;
}
return ret;
@@ -247,7 +247,7 @@ static void ide_complete_pm_request (ide
}
blkdev_dequeue_request(rq);
HWGROUP(drive)->rq = NULL;
- end_that_request_last(rq);
+ end_that_request_last(rq, 1);
spin_unlock_irqrestore(&ide_lock, flags);
}
@@ -379,7 +379,7 @@ void ide_end_drive_cmd (ide_drive_t *dri
blkdev_dequeue_request(rq);
HWGROUP(drive)->rq = NULL;
rq->errors = err;
- end_that_request_last(rq);
+ end_that_request_last(rq, !rq->errors);
spin_unlock_irqrestore(&ide_lock, flags);
}
Index: work/drivers/message/i2o/i2o_block.c
===================================================================
--- work.orig/drivers/message/i2o/i2o_block.c 2005-11-18 00:06:46.000000000 +0900
+++ work/drivers/message/i2o/i2o_block.c 2005-11-18 00:35:04.000000000 +0900
@@ -466,7 +466,7 @@ static void i2o_block_end_request(struct
spin_lock_irqsave(q->queue_lock, flags);
- end_that_request_last(req);
+ end_that_request_last(req, uptodate);
if (likely(dev)) {
dev->open_queue_depth--;
Index: work/drivers/mmc/mmc_block.c
===================================================================
--- work.orig/drivers/mmc/mmc_block.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/mmc/mmc_block.c 2005-11-18 00:35:04.000000000 +0900
@@ -263,7 +263,7 @@ static int mmc_blk_issue_rq(struct mmc_q
*/
add_disk_randomness(req->rq_disk);
blkdev_dequeue_request(req);
- end_that_request_last(req);
+ end_that_request_last(req, 1);
}
spin_unlock_irq(&md->lock);
} while (ret);
@@ -289,7 +289,7 @@ static int mmc_blk_issue_rq(struct mmc_q
add_disk_randomness(req->rq_disk);
blkdev_dequeue_request(req);
- end_that_request_last(req);
+ end_that_request_last(req, 0);
spin_unlock_irq(&md->lock);
return 0;
Index: work/drivers/s390/block/dasd.c
===================================================================
--- work.orig/drivers/s390/block/dasd.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/s390/block/dasd.c 2005-11-18 00:35:04.000000000 +0900
@@ -1035,7 +1035,7 @@ dasd_end_request(struct request *req, in
if (end_that_request_first(req, uptodate, req->hard_nr_sectors))
BUG();
add_disk_randomness(req->rq_disk);
- end_that_request_last(req);
+ end_that_request_last(req, uptodate);
}
/*
Index: work/drivers/s390/char/tape_block.c
===================================================================
--- work.orig/drivers/s390/char/tape_block.c 2005-11-18 00:06:46.000000000 +0900
+++ work/drivers/s390/char/tape_block.c 2005-11-18 00:35:04.000000000 +0900
@@ -78,7 +78,7 @@ tapeblock_end_request(struct request *re
{
if (end_that_request_first(req, uptodate, req->hard_nr_sectors))
BUG();
- end_that_request_last(req);
+ end_that_request_last(req, uptodate);
}
static void
Index: work/drivers/scsi/ide-scsi.c
===================================================================
--- work.orig/drivers/scsi/ide-scsi.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/scsi/ide-scsi.c 2005-11-18 00:35:04.000000000 +0900
@@ -1046,7 +1046,7 @@ static int idescsi_eh_reset (struct scsi
/* kill current request */
blkdev_dequeue_request(req);
- end_that_request_last(req);
+ end_that_request_last(req, 0);
if (req->flags & REQ_SENSE)
kfree(scsi->pc->buffer);
kfree(scsi->pc);
@@ -1056,7 +1056,7 @@ static int idescsi_eh_reset (struct scsi
/* now nuke the drive queue */
while ((req = elv_next_request(drive->queue))) {
blkdev_dequeue_request(req);
- end_that_request_last(req);
+ end_that_request_last(req, 0);
}
HWGROUP(drive)->rq = NULL;
Index: work/drivers/scsi/scsi_lib.c
===================================================================
--- work.orig/drivers/scsi/scsi_lib.c 2005-11-18 00:14:22.000000000 +0900
+++ work/drivers/scsi/scsi_lib.c 2005-11-18 00:35:04.000000000 +0900
@@ -624,7 +624,7 @@ static struct scsi_cmnd *scsi_end_reques
spin_lock_irqsave(q->queue_lock, flags);
if (blk_rq_tagged(req))
blk_queue_end_tag(q, req);
- end_that_request_last(req);
+ end_that_request_last(req, uptodate);
spin_unlock_irqrestore(q->queue_lock, flags);
/*
Index: work/drivers/scsi/sd.c
===================================================================
--- work.orig/drivers/scsi/sd.c 2005-11-18 00:14:22.000000000 +0900
+++ work/drivers/scsi/sd.c 2005-11-18 00:35:04.000000000 +0900
@@ -762,7 +762,7 @@ static void sd_end_flush(request_queue_t
* force journal abort of barriers
*/
end_that_request_first(rq, -EOPNOTSUPP, rq->hard_nr_sectors);
- end_that_request_last(rq);
+ end_that_request_last(rq, -EOPNOTSUPP);
}
}
Index: work/include/linux/blkdev.h
===================================================================
--- work.orig/include/linux/blkdev.h 2005-11-18 00:14:29.000000000 +0900
+++ work/include/linux/blkdev.h 2005-11-18 00:35:04.000000000 +0900
@@ -102,7 +102,7 @@ void copy_io_context(struct io_context *
void swap_io_context(struct io_context **ioc1, struct io_context **ioc2);
struct request;
-typedef void (rq_end_io_fn)(struct request *);
+typedef void (rq_end_io_fn)(struct request *, int);
struct request_list {
int count[2];
@@ -557,7 +557,7 @@ extern void blk_unregister_queue(struct
extern void register_disk(struct gendisk *dev);
extern void generic_make_request(struct bio *bio);
extern void blk_put_request(struct request *);
-extern void blk_end_sync_rq(struct request *rq);
+extern void blk_end_sync_rq(struct request *rq, int error);
extern void blk_attempt_remerge(request_queue_t *, struct request *);
extern struct request *blk_get_request(request_queue_t *, int, gfp_t);
extern void blk_insert_request(request_queue_t *, struct request *, int, void *);
@@ -607,7 +607,7 @@ static inline void blk_run_address_space
*/
extern int end_that_request_first(struct request *, int, int);
extern int end_that_request_chunk(struct request *, int, int);
-extern void end_that_request_last(struct request *);
+extern void end_that_request_last(struct request *, int);
extern void end_request(struct request *req, int uptodate);
/*
Index: work/block/elevator.c
===================================================================
--- work.orig/block/elevator.c 2005-11-18 00:14:21.000000000 +0900
+++ work/block/elevator.c 2005-11-18 00:35:04.000000000 +0900
@@ -484,7 +484,7 @@ struct request *elv_next_request(request
blkdev_dequeue_request(rq);
rq->flags |= REQ_QUIET;
end_that_request_chunk(rq, 0, nr_bytes);
- end_that_request_last(rq);
+ end_that_request_last(rq, 0);
} else {
printk(KERN_ERR "%s: bad return=%d\n", __FUNCTION__,
ret);
Index: work/block/ll_rw_blk.c
===================================================================
--- work.orig/block/ll_rw_blk.c 2005-11-18 00:14:21.000000000 +0900
+++ work/block/ll_rw_blk.c 2005-11-18 00:35:04.000000000 +0900
@@ -346,7 +346,7 @@ EXPORT_SYMBOL(blk_queue_issue_flush_fn);
/*
* Cache flushing for ordered writes handling
*/
-static void blk_pre_flush_end_io(struct request *flush_rq)
+static void blk_pre_flush_end_io(struct request *flush_rq, int error)
{
struct request *rq = flush_rq->end_io_data;
request_queue_t *q = rq->q;
@@ -364,7 +364,7 @@ static void blk_pre_flush_end_io(struct
}
}
-static void blk_post_flush_end_io(struct request *flush_rq)
+static void blk_post_flush_end_io(struct request *flush_rq, int error)
{
struct request *rq = flush_rq->end_io_data;
request_queue_t *q = rq->q;
@@ -2301,7 +2301,7 @@ EXPORT_SYMBOL(blk_rq_map_kern);
*/
void blk_execute_rq_nowait(request_queue_t *q, struct gendisk *bd_disk,
struct request *rq, int at_head,
- void (*done)(struct request *))
+ rq_end_io_fn *done)
{
int where = at_head ? ELEVATOR_INSERT_FRONT : ELEVATOR_INSERT_BACK;
@@ -2501,7 +2501,7 @@ EXPORT_SYMBOL(blk_put_request);
* blk_end_sync_rq - executes a completion event on a request
* @rq: request to complete
*/
-void blk_end_sync_rq(struct request *rq)
+void blk_end_sync_rq(struct request *rq, int error)
{
struct completion *waiting = rq->waiting;
@@ -3163,9 +3163,17 @@ EXPORT_SYMBOL(end_that_request_chunk);
/*
* queue lock must be held
*/
-void end_that_request_last(struct request *req)
+void end_that_request_last(struct request *req, int uptodate)
{
struct gendisk *disk = req->rq_disk;
+ int error;
+
+ /*
+ * extend uptodate bool to allow < 0 value to be direct io error
+ */
+ error = 0;
+ if (end_io_error(uptodate))
+ error = !uptodate ? -EIO : uptodate;
if (unlikely(laptop_mode) && blk_fs_request(req))
laptop_io_completion();
@@ -3180,7 +3188,7 @@ void end_that_request_last(struct reques
disk->in_flight--;
}
if (req->end_io)
- req->end_io(req);
+ req->end_io(req, error);
else
__blk_put_request(req->q, req);
}
@@ -3192,7 +3200,7 @@ void end_request(struct request *req, in
if (!end_that_request_first(req, uptodate, req->hard_cur_sectors)) {
add_disk_randomness(req->rq_disk);
blkdev_dequeue_request(req);
- end_that_request_last(req);
+ end_that_request_last(req, uptodate);
}
}
07_blk_libata-add-fua-support.patch
Add FUA support to libata.
Signed-off-by: Tejun Heo <[email protected]>
drivers/scsi/libata-core.c | 31 +++++++++++++++++++++++++------
drivers/scsi/libata-scsi.c | 32 ++++++++++++++++++++++++++------
drivers/scsi/libata.h | 4 +++-
include/linux/ata.h | 6 +++++-
include/linux/libata.h | 3 ++-
5 files changed, 61 insertions(+), 15 deletions(-)
Index: work/drivers/scsi/libata-core.c
===================================================================
--- work.orig/drivers/scsi/libata-core.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/scsi/libata-core.c 2005-11-18 00:35:06.000000000 +0900
@@ -563,16 +563,28 @@ static const u8 ata_rw_cmds[] = {
ATA_CMD_WRITE_MULTI,
ATA_CMD_READ_MULTI_EXT,
ATA_CMD_WRITE_MULTI_EXT,
+ 0,
+ 0,
+ 0,
+ ATA_CMD_WRITE_MULTI_FUA_EXT,
/* pio */
ATA_CMD_PIO_READ,
ATA_CMD_PIO_WRITE,
ATA_CMD_PIO_READ_EXT,
ATA_CMD_PIO_WRITE_EXT,
+ 0,
+ 0,
+ 0,
+ 0,
/* dma */
ATA_CMD_READ,
ATA_CMD_WRITE,
ATA_CMD_READ_EXT,
- ATA_CMD_WRITE_EXT
+ ATA_CMD_WRITE_EXT,
+ 0,
+ 0,
+ 0,
+ ATA_CMD_WRITE_FUA_EXT
};
/**
@@ -585,25 +597,32 @@ static const u8 ata_rw_cmds[] = {
* LOCKING:
* caller.
*/
-void ata_rwcmd_protocol(struct ata_queued_cmd *qc)
+int ata_rwcmd_protocol(struct ata_queued_cmd *qc)
{
struct ata_taskfile *tf = &qc->tf;
struct ata_device *dev = qc->dev;
+ u8 cmd;
- int index, lba48, write;
+ int index, fua, lba48, write;
+ fua = (tf->flags & ATA_TFLAG_FUA) ? 4 : 0;
lba48 = (tf->flags & ATA_TFLAG_LBA48) ? 2 : 0;
write = (tf->flags & ATA_TFLAG_WRITE) ? 1 : 0;
if (dev->flags & ATA_DFLAG_PIO) {
tf->protocol = ATA_PROT_PIO;
- index = dev->multi_count ? 0 : 4;
+ index = dev->multi_count ? 0 : 8;
} else {
tf->protocol = ATA_PROT_DMA;
- index = 8;
+ index = 16;
}
- tf->command = ata_rw_cmds[index + lba48 + write];
+ cmd = ata_rw_cmds[index + fua + lba48 + write];
+ if (cmd) {
+ tf->command = cmd;
+ return 0;
+ }
+ return -1;
}
static const char * xfer_mode_str[] = {
Index: work/include/linux/ata.h
===================================================================
--- work.orig/include/linux/ata.h 2005-11-18 00:14:29.000000000 +0900
+++ work/include/linux/ata.h 2005-11-18 00:35:06.000000000 +0900
@@ -129,6 +129,7 @@ enum {
ATA_CMD_READ_EXT = 0x25,
ATA_CMD_WRITE = 0xCA,
ATA_CMD_WRITE_EXT = 0x35,
+ ATA_CMD_WRITE_FUA_EXT = 0x3D,
ATA_CMD_PIO_READ = 0x20,
ATA_CMD_PIO_READ_EXT = 0x24,
ATA_CMD_PIO_WRITE = 0x30,
@@ -137,6 +138,7 @@ enum {
ATA_CMD_READ_MULTI_EXT = 0x29,
ATA_CMD_WRITE_MULTI = 0xC5,
ATA_CMD_WRITE_MULTI_EXT = 0x39,
+ ATA_CMD_WRITE_MULTI_FUA_EXT = 0xCE,
ATA_CMD_SET_FEATURES = 0xEF,
ATA_CMD_PACKET = 0xA0,
ATA_CMD_VERIFY = 0x40,
@@ -192,6 +194,7 @@ enum {
ATA_TFLAG_DEVICE = (1 << 2), /* enable r/w to device reg */
ATA_TFLAG_WRITE = (1 << 3), /* data dir: host->dev==1 (write) */
ATA_TFLAG_LBA = (1 << 4), /* enable LBA */
+ ATA_TFLAG_FUA = (1 << 5), /* enable FUA */
};
enum ata_tf_protocols {
@@ -245,7 +248,8 @@ struct ata_taskfile {
#define ata_id_is_sata(id) ((id)[93] == 0)
#define ata_id_rahead_enabled(id) ((id)[85] & (1 << 6))
#define ata_id_wcache_enabled(id) ((id)[85] & (1 << 5))
-#define ata_id_has_flush(id) ((id)[83] & (1 << 12))
+#define ata_id_has_fua(id) ((id)[84] & (1 << 6))
+#define ata_id_has_flush(id) ((id)[83] & (1 << 12))
#define ata_id_has_flush_ext(id) ((id)[83] & (1 << 13))
#define ata_id_has_lba48(id) ((id)[83] & (1 << 10))
#define ata_id_has_wcache(id) ((id)[82] & (1 << 5))
Index: work/drivers/scsi/libata-scsi.c
===================================================================
--- work.orig/drivers/scsi/libata-scsi.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/scsi/libata-scsi.c 2005-11-18 00:35:06.000000000 +0900
@@ -1080,11 +1080,13 @@ static unsigned int ata_scsi_rw_xlat(str
scsicmd[0] == WRITE_16)
tf->flags |= ATA_TFLAG_WRITE;
- /* Calculate the SCSI LBA and transfer length. */
+ /* Calculate the SCSI LBA, transfer length and FUA. */
switch (scsicmd[0]) {
case READ_10:
case WRITE_10:
scsi_10_lba_len(scsicmd, &block, &n_block);
+ if (unlikely(scsicmd[1] & (1 << 3)))
+ tf->flags |= ATA_TFLAG_FUA;
break;
case READ_6:
case WRITE_6:
@@ -1099,6 +1101,8 @@ static unsigned int ata_scsi_rw_xlat(str
case READ_16:
case WRITE_16:
scsi_16_lba_len(scsicmd, &block, &n_block);
+ if (unlikely(scsicmd[1] & (1 << 3)))
+ tf->flags |= ATA_TFLAG_FUA;
break;
default:
DPRINTK("no-byte command\n");
@@ -1142,7 +1146,8 @@ static unsigned int ata_scsi_rw_xlat(str
tf->device |= (block >> 24) & 0xf;
}
- ata_rwcmd_protocol(qc);
+ if (unlikely(ata_rwcmd_protocol(qc) < 0))
+ goto invalid_fld;
qc->nsect = n_block;
tf->nsect = n_block & 0xff;
@@ -1160,7 +1165,8 @@ static unsigned int ata_scsi_rw_xlat(str
if ((block >> 28) || (n_block > 256))
goto out_of_range;
- ata_rwcmd_protocol(qc);
+ if (unlikely(ata_rwcmd_protocol(qc) < 0))
+ goto invalid_fld;
/* Convert LBA to CHS */
track = (u32)block / dev->sectors;
@@ -1696,6 +1702,7 @@ static unsigned int ata_msense_rw_recove
unsigned int ata_scsiop_mode_sense(struct ata_scsi_args *args, u8 *rbuf,
unsigned int buflen)
{
+ struct ata_device *dev = args->dev;
u8 *scsicmd = args->cmd->cmnd, *p, *last;
const u8 sat_blk_desc[] = {
0, 0, 0, 0, /* number of blocks: sat unspecified */
@@ -1704,6 +1711,7 @@ unsigned int ata_scsiop_mode_sense(struc
};
u8 pg, spg;
unsigned int ebd, page_control, six_byte, output_len, alloc_len, minlen;
+ u8 dpofua;
VPRINTK("ENTER\n");
@@ -1772,9 +1780,17 @@ unsigned int ata_scsiop_mode_sense(struc
if (minlen < 1)
return 0;
+
+ dpofua = 0;
+ if (ata_id_has_fua(args->id) && dev->flags & ATA_DFLAG_LBA48 &&
+ (!(dev->flags & ATA_DFLAG_PIO) || dev->multi_count))
+ dpofua = 1 << 4;
+
if (six_byte) {
output_len--;
rbuf[0] = output_len;
+ if (minlen > 2)
+ rbuf[2] |= dpofua;
if (ebd) {
if (minlen > 3)
rbuf[3] = sizeof(sat_blk_desc);
@@ -1787,6 +1803,8 @@ unsigned int ata_scsiop_mode_sense(struc
rbuf[0] = output_len >> 8;
if (minlen > 1)
rbuf[1] = output_len;
+ if (minlen > 3)
+ rbuf[3] |= dpofua;
if (ebd) {
if (minlen > 7)
rbuf[7] = sizeof(sat_blk_desc);
@@ -2424,7 +2442,7 @@ int ata_scsi_queuecmd(struct scsi_cmnd *
if (xlat_func)
ata_scsi_translate(ap, dev, cmd, done, xlat_func);
else
- ata_scsi_simulate(dev->id, cmd, done);
+ ata_scsi_simulate(ap, dev, cmd, done);
} else
ata_scsi_translate(ap, dev, cmd, done, atapi_xlat);
@@ -2447,14 +2465,16 @@ out_unlock:
* spin_lock_irqsave(host_set lock)
*/
-void ata_scsi_simulate(u16 *id,
+void ata_scsi_simulate(struct ata_port *ap, struct ata_device *dev,
struct scsi_cmnd *cmd,
void (*done)(struct scsi_cmnd *))
{
struct ata_scsi_args args;
const u8 *scsicmd = cmd->cmnd;
- args.id = id;
+ args.ap = ap;
+ args.dev = dev;
+ args.id = dev->id;
args.cmd = cmd;
args.done = done;
Index: work/drivers/scsi/libata.h
===================================================================
--- work.orig/drivers/scsi/libata.h 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/scsi/libata.h 2005-11-18 00:35:06.000000000 +0900
@@ -32,6 +32,8 @@
#define DRV_VERSION "1.12" /* must be exactly four chars */
struct ata_scsi_args {
+ struct ata_port *ap;
+ struct ata_device *dev;
u16 *id;
struct scsi_cmnd *cmd;
void (*done)(struct scsi_cmnd *);
@@ -42,7 +44,7 @@ extern int atapi_enabled;
extern int ata_qc_complete_noop(struct ata_queued_cmd *qc, unsigned int err_mask);
extern struct ata_queued_cmd *ata_qc_new_init(struct ata_port *ap,
struct ata_device *dev);
-extern void ata_rwcmd_protocol(struct ata_queued_cmd *qc);
+extern int ata_rwcmd_protocol(struct ata_queued_cmd *qc);
extern void ata_qc_free(struct ata_queued_cmd *qc);
extern int ata_qc_issue(struct ata_queued_cmd *qc);
extern int ata_check_atapi_dma(struct ata_queued_cmd *qc);
Index: work/include/linux/libata.h
===================================================================
--- work.orig/include/linux/libata.h 2005-11-18 00:14:29.000000000 +0900
+++ work/include/linux/libata.h 2005-11-18 00:35:06.000000000 +0900
@@ -476,7 +476,8 @@ extern u8 ata_bmdma_status(struct ata_
extern void ata_bmdma_irq_clear(struct ata_port *ap);
extern void ata_qc_complete(struct ata_queued_cmd *qc, unsigned int err_mask);
extern void ata_eng_timeout(struct ata_port *ap);
-extern void ata_scsi_simulate(u16 *id, struct scsi_cmnd *cmd,
+extern void ata_scsi_simulate(struct ata_port *ap, struct ata_device *dev,
+ struct scsi_cmnd *cmd,
void (*done)(struct scsi_cmnd *));
extern int ata_std_bios_param(struct scsi_device *sdev,
struct block_device *bdev,
05_blk_scsi-add-fua-support.patch
Add FUA support to SCSI disk.
Signed-off-by: Tejun Heo <[email protected]>
sd.c | 29 ++++++++++++++++++++++++++---
1 file changed, 26 insertions(+), 3 deletions(-)
Index: work/drivers/scsi/sd.c
===================================================================
--- work.orig/drivers/scsi/sd.c 2005-11-18 00:35:05.000000000 +0900
+++ work/drivers/scsi/sd.c 2005-11-18 00:35:05.000000000 +0900
@@ -102,6 +102,7 @@ struct scsi_disk {
u8 write_prot;
unsigned WCE : 1; /* state of disk WCE bit */
unsigned RCD : 1; /* state of disk RCD bit, unused */
+ unsigned DPOFUA : 1; /* state of disk DPOFUA bit */
};
static DEFINE_IDR(sd_index_idr);
@@ -357,6 +358,7 @@ static int sd_init_command(struct scsi_c
if (block > 0xffffffff) {
SCpnt->cmnd[0] += READ_16 - READ_6;
+ SCpnt->cmnd[1] |= blk_fua_rq(rq) ? 0x8 : 0;
SCpnt->cmnd[2] = sizeof(block) > 4 ? (unsigned char) (block >> 56) & 0xff : 0;
SCpnt->cmnd[3] = sizeof(block) > 4 ? (unsigned char) (block >> 48) & 0xff : 0;
SCpnt->cmnd[4] = sizeof(block) > 4 ? (unsigned char) (block >> 40) & 0xff : 0;
@@ -376,6 +378,7 @@ static int sd_init_command(struct scsi_c
this_count = 0xffff;
SCpnt->cmnd[0] += READ_10 - READ_6;
+ SCpnt->cmnd[1] |= blk_fua_rq(rq) ? 0x8 : 0;
SCpnt->cmnd[2] = (unsigned char) (block >> 24) & 0xff;
SCpnt->cmnd[3] = (unsigned char) (block >> 16) & 0xff;
SCpnt->cmnd[4] = (unsigned char) (block >> 8) & 0xff;
@@ -384,6 +387,17 @@ static int sd_init_command(struct scsi_c
SCpnt->cmnd[7] = (unsigned char) (this_count >> 8) & 0xff;
SCpnt->cmnd[8] = (unsigned char) this_count & 0xff;
} else {
+ if (unlikely(blk_fua_rq(rq))) {
+ /*
+ * This happens only if this drive failed
+ * 10byte rw command with ILLEGAL_REQUEST
+ * during operation and thus turned off
+ * use_10_for_rw.
+ */
+ printk(KERN_ERR "sd: FUA write on READ/WRITE(6) drive\n");
+ return 0;
+ }
+
SCpnt->cmnd[1] |= (unsigned char) ((block >> 16) & 0x1f);
SCpnt->cmnd[2] = (unsigned char) ((block >> 8) & 0xff);
SCpnt->cmnd[3] = (unsigned char) block & 0xff;
@@ -1409,10 +1423,18 @@ sd_read_cache_type(struct scsi_disk *sdk
sdkp->RCD = 0;
}
+ sdkp->DPOFUA = (data.device_specific & 0x10) != 0;
+ if (sdkp->DPOFUA && !sdkp->device->use_10_for_rw) {
+ printk(KERN_NOTICE "SCSI device %s: uses "
+ "READ/WRITE(6), disabling FUA\n", diskname);
+ sdkp->DPOFUA = 0;
+ }
+
ct = sdkp->RCD + 2*sdkp->WCE;
- printk(KERN_NOTICE "SCSI device %s: drive cache: %s\n",
- diskname, types[ct]);
+ printk(KERN_NOTICE "SCSI device %s: drive cache: %s%s\n",
+ diskname, types[ct],
+ sdkp->DPOFUA ? " w/ FUA" : "");
return;
}
@@ -1491,7 +1513,8 @@ static int sd_revalidate_disk(struct gen
* QUEUE_ORDERED_TAG_* even when ordered tag is supported.
*/
if (sdkp->WCE)
- ordered = QUEUE_ORDERED_DRAIN_FLUSH;
+ ordered = sdkp->DPOFUA
+ ? QUEUE_ORDERED_DRAIN_FUA : QUEUE_ORDERED_DRAIN_FLUSH;
else
ordered = QUEUE_ORDERED_DRAIN;
02_blk_implement-init_request_from_bio.patch
Separate out bio initialization part from __make_request. It
will be used by the following blk_ordered_reimpl.
Signed-off-by: Tejun Heo <[email protected]>
ll_rw_blk.c | 62 +++++++++++++++++++++++++++++++-----------------------------
1 file changed, 33 insertions(+), 29 deletions(-)
Index: work/block/ll_rw_blk.c
===================================================================
--- work.orig/block/ll_rw_blk.c 2005-11-18 00:35:04.000000000 +0900
+++ work/block/ll_rw_blk.c 2005-11-18 00:35:05.000000000 +0900
@@ -38,6 +38,8 @@
static void blk_unplug_work(void *data);
static void blk_unplug_timeout(unsigned long data);
static void drive_stat_acct(struct request *rq, int nr_sectors, int new_io);
+static void init_request_from_bio(struct request *req, struct bio *bio);
+static int __make_request(request_queue_t *q, struct bio *bio);
/*
* For the allocated request tables
@@ -1651,8 +1653,6 @@ static int blk_init_free_list(request_qu
return 0;
}
-static int __make_request(request_queue_t *, struct bio *);
-
request_queue_t *blk_alloc_queue(gfp_t gfp_mask)
{
return blk_alloc_queue_node(gfp_mask, -1);
@@ -2639,6 +2639,36 @@ void blk_attempt_remerge(request_queue_t
EXPORT_SYMBOL(blk_attempt_remerge);
+static void init_request_from_bio(struct request *req, struct bio *bio)
+{
+ req->flags |= REQ_CMD;
+
+ /*
+ * inherit FAILFAST from bio (for read-ahead, and explicit FAILFAST)
+ */
+ if (bio_rw_ahead(bio) || bio_failfast(bio))
+ req->flags |= REQ_FAILFAST;
+
+ /*
+ * REQ_BARRIER implies no merging, but lets make it explicit
+ */
+ if (unlikely(bio_barrier(bio)))
+ req->flags |= (REQ_HARDBARRIER | REQ_NOMERGE);
+
+ req->errors = 0;
+ req->hard_sector = req->sector = bio->bi_sector;
+ req->hard_nr_sectors = req->nr_sectors = bio_sectors(bio);
+ req->current_nr_sectors = req->hard_cur_sectors = bio_cur_sectors(bio);
+ req->nr_phys_segments = bio_phys_segments(req->q, bio);
+ req->nr_hw_segments = bio_hw_segments(req->q, bio);
+ req->buffer = bio_data(bio); /* see ->buffer comment above */
+ req->waiting = NULL;
+ req->bio = req->biotail = bio;
+ req->ioprio = bio_prio(bio);
+ req->rq_disk = bio->bi_bdev->bd_disk;
+ req->start_time = jiffies;
+}
+
static int __make_request(request_queue_t *q, struct bio *bio)
{
struct request *req;
@@ -2734,33 +2764,7 @@ get_rq:
* We don't worry about that case for efficiency. It won't happen
* often, and the elevators are able to handle it.
*/
-
- req->flags |= REQ_CMD;
-
- /*
- * inherit FAILFAST from bio (for read-ahead, and explicit FAILFAST)
- */
- if (bio_rw_ahead(bio) || bio_failfast(bio))
- req->flags |= REQ_FAILFAST;
-
- /*
- * REQ_BARRIER implies no merging, but lets make it explicit
- */
- if (unlikely(barrier))
- req->flags |= (REQ_HARDBARRIER | REQ_NOMERGE);
-
- req->errors = 0;
- req->hard_sector = req->sector = sector;
- req->hard_nr_sectors = req->nr_sectors = nr_sectors;
- req->current_nr_sectors = req->hard_cur_sectors = cur_nr_sectors;
- req->nr_phys_segments = bio_phys_segments(q, bio);
- req->nr_hw_segments = bio_hw_segments(q, bio);
- req->buffer = bio_data(bio); /* see ->buffer comment above */
- req->waiting = NULL;
- req->bio = req->biotail = bio;
- req->ioprio = prio;
- req->rq_disk = bio->bi_bdev->bd_disk;
- req->start_time = jiffies;
+ init_request_from_bio(req, bio);
spin_lock_irq(q->queue_lock);
if (elv_queue_empty(q))
04_blk_scsi-update-ordered.patch
All ordered request related stuff delegated to HLD. Midlayer
now doens't deal with ordered setting or prepare_flush
callback. sd.c updated to deal with blk_queue_ordered
setting. Currently, ordered tag isn't used as SCSI midlayer
cannot guarantee request ordering.
Signed-off-by: Tejun Heo <[email protected]>
drivers/scsi/hosts.c | 9 ------
drivers/scsi/scsi_lib.c | 46 ----------------------------------
drivers/scsi/sd.c | 60 ++++++++++++++++-----------------------------
include/scsi/scsi_driver.h | 1
include/scsi/scsi_host.h | 1
5 files changed, 22 insertions(+), 95 deletions(-)
Index: work/drivers/scsi/hosts.c
===================================================================
--- work.orig/drivers/scsi/hosts.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/scsi/hosts.c 2005-11-18 00:35:05.000000000 +0900
@@ -347,17 +347,8 @@ struct Scsi_Host *scsi_host_alloc(struct
shost->cmd_per_lun = sht->cmd_per_lun;
shost->unchecked_isa_dma = sht->unchecked_isa_dma;
shost->use_clustering = sht->use_clustering;
- shost->ordered_flush = sht->ordered_flush;
shost->ordered_tag = sht->ordered_tag;
- /*
- * hosts/devices that do queueing must support ordered tags
- */
- if (shost->can_queue > 1 && shost->ordered_flush) {
- printk(KERN_ERR "scsi: ordered flushes don't support queueing\n");
- shost->ordered_flush = 0;
- }
-
if (sht->max_host_blocked)
shost->max_host_blocked = sht->max_host_blocked;
else
Index: work/drivers/scsi/scsi_lib.c
===================================================================
--- work.orig/drivers/scsi/scsi_lib.c 2005-11-18 00:35:04.000000000 +0900
+++ work/drivers/scsi/scsi_lib.c 2005-11-18 00:35:05.000000000 +0900
@@ -765,9 +765,6 @@ void scsi_io_completion(struct scsi_cmnd
int sense_valid = 0;
int sense_deferred = 0;
- if (blk_complete_barrier_rq(q, req, good_bytes >> 9))
- return;
-
/*
* Free up any indirection buffers we allocated for DMA purposes.
* For the case of a READ, we need to copy the data out of the
@@ -1031,38 +1028,6 @@ static int scsi_init_io(struct scsi_cmnd
return BLKPREP_KILL;
}
-static int scsi_prepare_flush_fn(request_queue_t *q, struct request *rq)
-{
- struct scsi_device *sdev = q->queuedata;
- struct scsi_driver *drv;
-
- if (sdev->sdev_state == SDEV_RUNNING) {
- drv = *(struct scsi_driver **) rq->rq_disk->private_data;
-
- if (drv->prepare_flush)
- return drv->prepare_flush(q, rq);
- }
-
- return 0;
-}
-
-static void scsi_end_flush_fn(request_queue_t *q, struct request *rq)
-{
- struct scsi_device *sdev = q->queuedata;
- struct request *flush_rq = rq->end_io_data;
- struct scsi_driver *drv;
-
- if (flush_rq->errors) {
- printk("scsi: barrier error, disabling flush support\n");
- blk_queue_ordered(q, QUEUE_ORDERED_NONE);
- }
-
- if (sdev->sdev_state == SDEV_RUNNING) {
- drv = *(struct scsi_driver **) rq->rq_disk->private_data;
- drv->end_flush(q, rq);
- }
-}
-
static int scsi_issue_flush_fn(request_queue_t *q, struct gendisk *disk,
sector_t *error_sector)
{
@@ -1520,17 +1485,6 @@ struct request_queue *scsi_alloc_queue(s
blk_queue_segment_boundary(q, shost->dma_boundary);
blk_queue_issue_flush_fn(q, scsi_issue_flush_fn);
- /*
- * ordered tags are superior to flush ordering
- */
- if (shost->ordered_tag)
- blk_queue_ordered(q, QUEUE_ORDERED_TAG);
- else if (shost->ordered_flush) {
- blk_queue_ordered(q, QUEUE_ORDERED_FLUSH);
- q->prepare_flush_fn = scsi_prepare_flush_fn;
- q->end_flush_fn = scsi_end_flush_fn;
- }
-
if (!shost->use_clustering)
clear_bit(QUEUE_FLAG_CLUSTER, &q->queue_flags);
return q;
Index: work/drivers/scsi/sd.c
===================================================================
--- work.orig/drivers/scsi/sd.c 2005-11-18 00:35:04.000000000 +0900
+++ work/drivers/scsi/sd.c 2005-11-18 00:35:05.000000000 +0900
@@ -121,8 +121,7 @@ static void sd_shutdown(struct device *d
static void sd_rescan(struct device *);
static int sd_init_command(struct scsi_cmnd *);
static int sd_issue_flush(struct device *, sector_t *);
-static void sd_end_flush(request_queue_t *, struct request *);
-static int sd_prepare_flush(request_queue_t *, struct request *);
+static void sd_prepare_flush(request_queue_t *, struct request *);
static void sd_read_capacity(struct scsi_disk *sdkp, char *diskname,
unsigned char *buffer);
@@ -137,8 +136,6 @@ static struct scsi_driver sd_template =
.rescan = sd_rescan,
.init_command = sd_init_command,
.issue_flush = sd_issue_flush,
- .prepare_flush = sd_prepare_flush,
- .end_flush = sd_end_flush,
};
/*
@@ -743,42 +740,13 @@ static int sd_issue_flush(struct device
return ret;
}
-static void sd_end_flush(request_queue_t *q, struct request *flush_rq)
+static void sd_prepare_flush(request_queue_t *q, struct request *rq)
{
- struct request *rq = flush_rq->end_io_data;
- struct scsi_cmnd *cmd = rq->special;
- unsigned int bytes = rq->hard_nr_sectors << 9;
-
- if (!flush_rq->errors) {
- spin_unlock(q->queue_lock);
- scsi_io_completion(cmd, bytes, 0);
- spin_lock(q->queue_lock);
- } else if (blk_barrier_postflush(rq)) {
- spin_unlock(q->queue_lock);
- scsi_io_completion(cmd, 0, bytes);
- spin_lock(q->queue_lock);
- } else {
- /*
- * force journal abort of barriers
- */
- end_that_request_first(rq, -EOPNOTSUPP, rq->hard_nr_sectors);
- end_that_request_last(rq, -EOPNOTSUPP);
- }
-}
-
-static int sd_prepare_flush(request_queue_t *q, struct request *rq)
-{
- struct scsi_device *sdev = q->queuedata;
- struct scsi_disk *sdkp = dev_get_drvdata(&sdev->sdev_gendev);
-
- if (!sdkp || !sdkp->WCE)
- return 0;
-
memset(rq->cmd, 0, sizeof(rq->cmd));
- rq->flags |= REQ_BLOCK_PC | REQ_SOFTBARRIER;
+ rq->flags |= REQ_BLOCK_PC;
rq->timeout = SD_TIMEOUT;
rq->cmd[0] = SYNCHRONIZE_CACHE;
- return 1;
+ rq->cmd_len = 10;
}
static void sd_rescan(struct device *dev)
@@ -1476,6 +1444,7 @@ static int sd_revalidate_disk(struct gen
struct scsi_disk *sdkp = scsi_disk(disk);
struct scsi_device *sdp = sdkp->device;
unsigned char *buffer;
+ unsigned ordered;
SCSI_LOG_HLQUEUE(3, printk("sd_revalidate_disk: disk=%s\n", disk->disk_name));
@@ -1514,7 +1483,21 @@ static int sd_revalidate_disk(struct gen
buffer);
sd_read_cache_type(sdkp, disk->disk_name, buffer);
}
-
+
+ /*
+ * We now have all cache related info, determine how we deal
+ * with ordered requests. Note that as the current SCSI
+ * dispatch function can alter request order, we cannot use
+ * QUEUE_ORDERED_TAG_* even when ordered tag is supported.
+ */
+ if (sdkp->WCE)
+ ordered = QUEUE_ORDERED_DRAIN_FLUSH;
+ else
+ ordered = QUEUE_ORDERED_DRAIN;
+
+ blk_queue_ordered(sdkp->disk->queue, ordered, sd_prepare_flush,
+ GFP_KERNEL);
+
set_capacity(disk, sdkp->capacity);
kfree(buffer);
@@ -1614,6 +1597,7 @@ static int sd_probe(struct device *dev)
strcpy(gd->devfs_name, sdp->devfs_name);
gd->private_data = &sdkp->driver;
+ gd->queue = sdkp->device->request_queue;
sd_revalidate_disk(gd);
@@ -1621,7 +1605,6 @@ static int sd_probe(struct device *dev)
gd->flags = GENHD_FL_DRIVERFS;
if (sdp->removable)
gd->flags |= GENHD_FL_REMOVABLE;
- gd->queue = sdkp->device->request_queue;
dev_set_drvdata(dev, sdkp);
add_disk(gd);
@@ -1654,6 +1637,7 @@ static int sd_remove(struct device *dev)
{
struct scsi_disk *sdkp = dev_get_drvdata(dev);
+ blk_queue_ordered(sdkp->disk->queue, QUEUE_ORDERED_NONE, NULL, 0);
del_gendisk(sdkp->disk);
sd_shutdown(dev);
Index: work/include/scsi/scsi_driver.h
===================================================================
--- work.orig/include/scsi/scsi_driver.h 2005-11-18 00:06:46.000000000 +0900
+++ work/include/scsi/scsi_driver.h 2005-11-18 00:35:05.000000000 +0900
@@ -15,7 +15,6 @@ struct scsi_driver {
void (*rescan)(struct device *);
int (*issue_flush)(struct device *, sector_t *);
int (*prepare_flush)(struct request_queue *, struct request *);
- void (*end_flush)(struct request_queue *, struct request *);
};
#define to_scsi_driver(drv) \
container_of((drv), struct scsi_driver, gendrv)
Index: work/include/scsi/scsi_host.h
===================================================================
--- work.orig/include/scsi/scsi_host.h 2005-11-18 00:14:29.000000000 +0900
+++ work/include/scsi/scsi_host.h 2005-11-18 00:35:05.000000000 +0900
@@ -392,7 +392,6 @@ struct scsi_host_template {
/*
* ordered write support
*/
- unsigned ordered_flush:1;
unsigned ordered_tag:1;
/*
06_blk_libata-update-ordered.patch
Reflect changes in SCSI midlayer and updated to use new
ordered request implementation
Signed-off-by: Tejun Heo <[email protected]>
ahci.c | 1 -
ata_piix.c | 1 -
sata_mv.c | 1 -
sata_nv.c | 1 -
sata_promise.c | 1 -
sata_sil.c | 1 -
sata_sil24.c | 1 -
sata_sis.c | 1 -
sata_svw.c | 1 -
sata_sx4.c | 1 -
sata_uli.c | 1 -
sata_via.c | 1 -
sata_vsc.c | 1 -
13 files changed, 13 deletions(-)
Index: work/drivers/scsi/ahci.c
===================================================================
--- work.orig/drivers/scsi/ahci.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/scsi/ahci.c 2005-11-18 00:35:06.000000000 +0900
@@ -213,7 +213,6 @@ static struct scsi_host_template ahci_sh
.dma_boundary = AHCI_DMA_BOUNDARY,
.slave_configure = ata_scsi_slave_config,
.bios_param = ata_std_bios_param,
- .ordered_flush = 1,
};
static const struct ata_port_operations ahci_ops = {
Index: work/drivers/scsi/ata_piix.c
===================================================================
--- work.orig/drivers/scsi/ata_piix.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/scsi/ata_piix.c 2005-11-18 00:35:06.000000000 +0900
@@ -144,7 +144,6 @@ static struct scsi_host_template piix_sh
.dma_boundary = ATA_DMA_BOUNDARY,
.slave_configure = ata_scsi_slave_config,
.bios_param = ata_std_bios_param,
- .ordered_flush = 1,
};
static const struct ata_port_operations piix_pata_ops = {
Index: work/drivers/scsi/sata_mv.c
===================================================================
--- work.orig/drivers/scsi/sata_mv.c 2005-11-18 00:14:22.000000000 +0900
+++ work/drivers/scsi/sata_mv.c 2005-11-18 00:35:06.000000000 +0900
@@ -287,7 +287,6 @@ static struct scsi_host_template mv_sht
.dma_boundary = MV_DMA_BOUNDARY,
.slave_configure = ata_scsi_slave_config,
.bios_param = ata_std_bios_param,
- .ordered_flush = 1,
};
static const struct ata_port_operations mv_ops = {
Index: work/drivers/scsi/sata_nv.c
===================================================================
--- work.orig/drivers/scsi/sata_nv.c 2005-11-18 00:14:22.000000000 +0900
+++ work/drivers/scsi/sata_nv.c 2005-11-18 00:35:06.000000000 +0900
@@ -235,7 +235,6 @@ static struct scsi_host_template nv_sht
.dma_boundary = ATA_DMA_BOUNDARY,
.slave_configure = ata_scsi_slave_config,
.bios_param = ata_std_bios_param,
- .ordered_flush = 1,
};
static const struct ata_port_operations nv_ops = {
Index: work/drivers/scsi/sata_promise.c
===================================================================
--- work.orig/drivers/scsi/sata_promise.c 2005-11-18 00:14:22.000000000 +0900
+++ work/drivers/scsi/sata_promise.c 2005-11-18 00:35:06.000000000 +0900
@@ -111,7 +111,6 @@ static struct scsi_host_template pdc_ata
.dma_boundary = ATA_DMA_BOUNDARY,
.slave_configure = ata_scsi_slave_config,
.bios_param = ata_std_bios_param,
- .ordered_flush = 1,
};
static const struct ata_port_operations pdc_sata_ops = {
Index: work/drivers/scsi/sata_sil.c
===================================================================
--- work.orig/drivers/scsi/sata_sil.c 2005-11-18 00:14:22.000000000 +0900
+++ work/drivers/scsi/sata_sil.c 2005-11-18 00:35:06.000000000 +0900
@@ -147,7 +147,6 @@ static struct scsi_host_template sil_sht
.dma_boundary = ATA_DMA_BOUNDARY,
.slave_configure = ata_scsi_slave_config,
.bios_param = ata_std_bios_param,
- .ordered_flush = 1,
};
static const struct ata_port_operations sil_ops = {
Index: work/drivers/scsi/sata_sis.c
===================================================================
--- work.orig/drivers/scsi/sata_sis.c 2005-11-18 00:14:22.000000000 +0900
+++ work/drivers/scsi/sata_sis.c 2005-11-18 00:35:06.000000000 +0900
@@ -99,7 +99,6 @@ static struct scsi_host_template sis_sht
.dma_boundary = ATA_DMA_BOUNDARY,
.slave_configure = ata_scsi_slave_config,
.bios_param = ata_std_bios_param,
- .ordered_flush = 1,
};
static const struct ata_port_operations sis_ops = {
Index: work/drivers/scsi/sata_svw.c
===================================================================
--- work.orig/drivers/scsi/sata_svw.c 2005-11-18 00:14:22.000000000 +0900
+++ work/drivers/scsi/sata_svw.c 2005-11-18 00:35:06.000000000 +0900
@@ -303,7 +303,6 @@ static struct scsi_host_template k2_sata
.proc_info = k2_sata_proc_info,
#endif
.bios_param = ata_std_bios_param,
- .ordered_flush = 1,
};
Index: work/drivers/scsi/sata_sx4.c
===================================================================
--- work.orig/drivers/scsi/sata_sx4.c 2005-11-18 00:14:22.000000000 +0900
+++ work/drivers/scsi/sata_sx4.c 2005-11-18 00:35:06.000000000 +0900
@@ -194,7 +194,6 @@ static struct scsi_host_template pdc_sat
.dma_boundary = ATA_DMA_BOUNDARY,
.slave_configure = ata_scsi_slave_config,
.bios_param = ata_std_bios_param,
- .ordered_flush = 1,
};
static const struct ata_port_operations pdc_20621_ops = {
Index: work/drivers/scsi/sata_uli.c
===================================================================
--- work.orig/drivers/scsi/sata_uli.c 2005-11-18 00:14:22.000000000 +0900
+++ work/drivers/scsi/sata_uli.c 2005-11-18 00:35:06.000000000 +0900
@@ -87,7 +87,6 @@ static struct scsi_host_template uli_sht
.dma_boundary = ATA_DMA_BOUNDARY,
.slave_configure = ata_scsi_slave_config,
.bios_param = ata_std_bios_param,
- .ordered_flush = 1,
};
static const struct ata_port_operations uli_ops = {
Index: work/drivers/scsi/sata_via.c
===================================================================
--- work.orig/drivers/scsi/sata_via.c 2005-11-18 00:14:22.000000000 +0900
+++ work/drivers/scsi/sata_via.c 2005-11-18 00:35:06.000000000 +0900
@@ -106,7 +106,6 @@ static struct scsi_host_template svia_sh
.dma_boundary = ATA_DMA_BOUNDARY,
.slave_configure = ata_scsi_slave_config,
.bios_param = ata_std_bios_param,
- .ordered_flush = 1,
};
static const struct ata_port_operations svia_sata_ops = {
Index: work/drivers/scsi/sata_vsc.c
===================================================================
--- work.orig/drivers/scsi/sata_vsc.c 2005-11-18 00:14:22.000000000 +0900
+++ work/drivers/scsi/sata_vsc.c 2005-11-18 00:35:06.000000000 +0900
@@ -235,7 +235,6 @@ static struct scsi_host_template vsc_sat
.dma_boundary = ATA_DMA_BOUNDARY,
.slave_configure = ata_scsi_slave_config,
.bios_param = ata_std_bios_param,
- .ordered_flush = 1,
};
Index: work/drivers/scsi/sata_sil24.c
===================================================================
--- work.orig/drivers/scsi/sata_sil24.c 2005-11-18 00:14:22.000000000 +0900
+++ work/drivers/scsi/sata_sil24.c 2005-11-18 00:35:06.000000000 +0900
@@ -272,7 +272,6 @@ static struct scsi_host_template sil24_s
.dma_boundary = ATA_DMA_BOUNDARY,
.slave_configure = ata_scsi_slave_config,
.bios_param = ata_std_bios_param,
- .ordered_flush = 1, /* NCQ not supported yet */
};
static const struct ata_port_operations sil24_ops = {
09_blk_ide-add-fua-support.patch
Add FUA support to IDE
Signed-off-by: Tejun Heo <[email protected]>
drivers/ide/ide-disk.c | 57 +++++++++++++++++++++++++++++++++++++++++--------
include/linux/hdreg.h | 16 ++++++++++++-
include/linux/ide.h | 3 ++
3 files changed, 65 insertions(+), 11 deletions(-)
Index: work/drivers/ide/ide-disk.c
===================================================================
--- work.orig/drivers/ide/ide-disk.c 2005-11-18 00:35:06.000000000 +0900
+++ work/drivers/ide/ide-disk.c 2005-11-18 00:35:07.000000000 +0900
@@ -164,13 +164,14 @@ static ide_startstop_t __ide_do_rw_disk(
ide_hwif_t *hwif = HWIF(drive);
unsigned int dma = drive->using_dma;
u8 lba48 = (drive->addressing == 1) ? 1 : 0;
+ int fua = blk_fua_rq(rq);
task_ioreg_t command = WIN_NOP;
ata_nsector_t nsectors;
nsectors.all = (u16) rq->nr_sectors;
if (hwif->no_lba48_dma && lba48 && dma) {
- if (block + rq->nr_sectors > 1ULL << 28)
+ if (block + rq->nr_sectors > 1ULL << 28 || fua)
dma = 0;
else
lba48 = 0;
@@ -226,6 +227,16 @@ static ide_startstop_t __ide_do_rw_disk(
hwif->OUTB(tasklets[6], IDE_HCYL_REG);
hwif->OUTB(0x00|drive->select.all,IDE_SELECT_REG);
} else {
+ if (unlikely(fua)) {
+ /*
+ * This happens if LBA48 addressing is
+ * turned off during operation.
+ */
+ printk(KERN_ERR "%s: FUA write but LBA48 off\n",
+ drive->name);
+ goto fail;
+ }
+
hwif->OUTB(0x00, IDE_FEATURE_REG);
hwif->OUTB(nsectors.b.low, IDE_NSECTOR_REG);
hwif->OUTB(block, IDE_SECTOR_REG);
@@ -253,9 +264,12 @@ static ide_startstop_t __ide_do_rw_disk(
if (dma) {
if (!hwif->dma_setup(drive)) {
if (rq_data_dir(rq)) {
- command = lba48 ? WIN_WRITEDMA_EXT : WIN_WRITEDMA;
- if (drive->vdma)
- command = lba48 ? WIN_WRITE_EXT: WIN_WRITE;
+ if (!fua) {
+ command = lba48 ? WIN_WRITEDMA_EXT : WIN_WRITEDMA;
+ if (drive->vdma)
+ command = lba48 ? WIN_WRITE_EXT: WIN_WRITE;
+ } else
+ command = ATA_CMD_WRITE_FUA_EXT;
} else {
command = lba48 ? WIN_READDMA_EXT : WIN_READDMA;
if (drive->vdma)
@@ -284,8 +298,20 @@ static ide_startstop_t __ide_do_rw_disk(
} else {
if (drive->mult_count) {
hwif->data_phase = TASKFILE_MULTI_OUT;
- command = lba48 ? WIN_MULTWRITE_EXT : WIN_MULTWRITE;
+ if (!fua)
+ command = lba48 ? WIN_MULTWRITE_EXT : WIN_MULTWRITE;
+ else
+ command = ATA_CMD_WRITE_MULTI_FUA_EXT;
} else {
+ if (unlikely(fua)) {
+ /*
+ * This happens if multisector PIO is
+ * turned off during operation.
+ */
+ printk(KERN_ERR "%s: FUA write but in single "
+ "sector PIO mode\n", drive->name);
+ goto fail;
+ }
hwif->data_phase = TASKFILE_OUT;
command = lba48 ? WIN_WRITE_EXT : WIN_WRITE;
}
@@ -295,6 +321,10 @@ static ide_startstop_t __ide_do_rw_disk(
return pre_task_out_intr(drive, rq);
}
+
+ fail:
+ ide_end_request(drive, 0, 0);
+ return ide_stopped;
}
/*
@@ -846,7 +876,7 @@ static void idedisk_setup (ide_drive_t *
{
struct hd_driveid *id = drive->id;
unsigned long long capacity;
- int barrier;
+ int barrier, fua;
idedisk_add_settings(drive);
@@ -967,10 +997,19 @@ static void idedisk_setup (ide_drive_t *
barrier = 0;
}
- printk(KERN_INFO "%s: cache flushes %ssupported\n",
- drive->name, barrier ? "" : "not ");
+ fua = barrier && idedisk_supports_lba48(id) && ide_id_has_fua(id);
+ /* When using PIO, FUA needs multisector. */
+ if ((!drive->using_dma || drive->hwif->no_lba48_dma) &&
+ drive->mult_count == 0)
+ fua = 0;
+
+ printk(KERN_INFO "%s: cache flushes %ssupported%s\n",
+ drive->name, barrier ? "" : "not ",
+ fua ? " w/ FUA" : "");
if (barrier) {
- blk_queue_ordered(drive->queue, QUEUE_ORDERED_DRAIN_FLUSH,
+ unsigned ordered = fua ? QUEUE_ORDERED_DRAIN_FUA
+ : QUEUE_ORDERED_DRAIN_FLUSH;
+ blk_queue_ordered(drive->queue, ordered,
idedisk_prepare_flush, GFP_KERNEL);
blk_queue_issue_flush_fn(drive->queue, idedisk_issue_flush);
} else if (!drive->wcache)
Index: work/include/linux/hdreg.h
===================================================================
--- work.orig/include/linux/hdreg.h 2005-11-18 00:06:46.000000000 +0900
+++ work/include/linux/hdreg.h 2005-11-18 00:35:07.000000000 +0900
@@ -550,7 +550,13 @@ struct hd_driveid {
* cmd set-feature supported extensions
* 15: Shall be ZERO
* 14: Shall be ONE
- * 13:6 reserved
+ * 13: IDLE IMMEDIATE w/ UNLOAD FEATURE
+ * 12:11 reserved for technical report
+ * 10: URG for WRITE STREAM
+ * 9: URG for READ STREAM
+ * 8: 64-bit World wide name
+ * 7: WRITE DMA QUEUED FUA EXT
+ * 6: WRITE DMA/MULTIPLE FUA EXT
* 5: General Purpose Logging
* 4: Streaming Feature Set
* 3: Media Card Pass Through
@@ -600,7 +606,13 @@ struct hd_driveid {
* command set-feature default
* 15: Shall be ZERO
* 14: Shall be ONE
- * 13:6 reserved
+ * 13: IDLE IMMEDIATE w/ UNLOAD FEATURE
+ * 12:11 reserved for technical report
+ * 10: URG for WRITE STREAM
+ * 9: URG for READ STREAM
+ * 8: 64-bit World wide name
+ * 7: WRITE DMA QUEUED FUA EXT
+ * 6: WRITE DMA/MULTIPLE FUA EXT
* 5: General Purpose Logging enabled
* 4: Valid CONFIGURE STREAM executed
* 3: Media Card Pass Through enabled
Index: work/include/linux/ide.h
===================================================================
--- work.orig/include/linux/ide.h 2005-11-18 00:14:29.000000000 +0900
+++ work/include/linux/ide.h 2005-11-18 00:35:07.000000000 +0900
@@ -1503,6 +1503,9 @@ extern struct bus_type ide_bus_type;
/* check if CACHE FLUSH (EXT) command is supported (bits defined in ATA-6) */
#define ide_id_has_flush_cache(id) ((id)->cfs_enable_2 & 0x3000)
+/* check if WRITE DMA FUA EXT command is supported (defined in ATA-8) */
+#define ide_id_has_fua(id) ((id)->cfsse & 0x0040)
+
/* some Maxtor disks have bit 13 defined incorrectly so check bit 10 too */
#define ide_id_has_flush_cache_ext(id) \
(((id)->cfs_enable_2 & 0x2400) == 0x2400)
03_blk_reimplement-ordered.patch
Reimplement handling of barrier requests.
* Flexible handling to deal with various capabilities of
target devices.
* Retry support for falling back.
* Tagged queues which don't support ordered tag can do ordered.
Signed-off-by: Tejun Heo <[email protected]>
block/elevator.c | 83 +++++---
block/ll_rw_blk.c | 468 ++++++++++++++++++++++++++++++++---------------
include/linux/blkdev.h | 83 +++++---
include/linux/elevator.h | 1
4 files changed, 443 insertions(+), 192 deletions(-)
Index: work/block/elevator.c
===================================================================
--- work.orig/block/elevator.c 2005-11-18 00:35:04.000000000 +0900
+++ work/block/elevator.c 2005-11-18 00:35:05.000000000 +0900
@@ -303,22 +303,24 @@ void elv_requeue_request(request_queue_t
rq->flags &= ~REQ_STARTED;
- /*
- * if this is the flush, requeue the original instead and drop the flush
- */
- if (rq->flags & REQ_BAR_FLUSH) {
- clear_bit(QUEUE_FLAG_FLUSH, &q->queue_flags);
- rq = rq->end_io_data;
- }
-
- __elv_add_request(q, rq, ELEVATOR_INSERT_FRONT, 0);
+ __elv_add_request(q, rq, ELEVATOR_INSERT_REQUEUE, 0);
}
void __elv_add_request(request_queue_t *q, struct request *rq, int where,
int plug)
{
+ struct list_head *pos;
+ unsigned ordseq;
+
+ rq->flags |= q->ordcolor ? REQ_ORDERED_COLOR : 0;
+
if (rq->flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER)) {
/*
+ * toggle ordered color
+ */
+ q->ordcolor ^= 1;
+
+ /*
* barriers implicitly indicate back insertion
*/
if (where == ELEVATOR_INSERT_SORT)
@@ -379,6 +381,30 @@ void __elv_add_request(request_queue_t *
q->elevator->ops->elevator_add_req_fn(q, rq);
break;
+ case ELEVATOR_INSERT_REQUEUE:
+ /*
+ * If ordered flush isn't in progress, we do front
+ * insertion; otherwise, requests should be requeued
+ * in ordseq order.
+ */
+ rq->flags |= REQ_SOFTBARRIER;
+
+ if (q->ordseq == 0) {
+ list_add(&rq->queuelist, &q->queue_head);
+ break;
+ }
+
+ ordseq = blk_ordered_req_seq(rq);
+
+ list_for_each(pos, &q->queue_head) {
+ struct request *pos_rq = list_entry_rq(pos);
+ if (ordseq <= blk_ordered_req_seq(pos_rq))
+ break;
+ }
+
+ list_add_tail(&rq->queuelist, pos);
+ break;
+
default:
printk(KERN_ERR "%s: bad insertion point %d\n",
__FUNCTION__, where);
@@ -408,25 +434,16 @@ static inline struct request *__elv_next
{
struct request *rq;
- if (unlikely(list_empty(&q->queue_head) &&
- !q->elevator->ops->elevator_dispatch_fn(q, 0)))
- return NULL;
-
- rq = list_entry_rq(q->queue_head.next);
-
- /*
- * if this is a barrier write and the device has to issue a
- * flush sequence to support it, check how far we are
- */
- if (blk_fs_request(rq) && blk_barrier_rq(rq)) {
- BUG_ON(q->ordered == QUEUE_ORDERED_NONE);
+ while (1) {
+ while (!list_empty(&q->queue_head)) {
+ rq = list_entry_rq(q->queue_head.next);
+ if (blk_do_ordered(q, &rq))
+ return rq;
+ }
- if (q->ordered == QUEUE_ORDERED_FLUSH &&
- !blk_barrier_preflush(rq))
- rq = blk_start_pre_flush(q, rq);
+ if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
+ return NULL;
}
-
- return rq;
}
struct request *elv_next_request(request_queue_t *q)
@@ -593,7 +610,21 @@ void elv_completed_request(request_queue
* request is released from the driver, io must be done
*/
if (blk_account_rq(rq)) {
+ struct request *first_rq = list_entry_rq(q->queue_head.next);
+
q->in_flight--;
+
+ /*
+ * Check if the queue is waiting for fs requests to be
+ * drained for flush sequence.
+ */
+ if (q->ordseq && q->in_flight == 0 &&
+ blk_ordered_cur_seq(q) == QUEUE_ORDSEQ_DRAIN &&
+ blk_ordered_req_seq(first_rq) > QUEUE_ORDSEQ_DRAIN) {
+ blk_ordered_complete_seq(q, QUEUE_ORDSEQ_DRAIN, 0);
+ q->request_fn(q);
+ }
+
if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
e->ops->elevator_completed_req_fn(q, rq);
}
Index: work/block/ll_rw_blk.c
===================================================================
--- work.orig/block/ll_rw_blk.c 2005-11-18 00:35:05.000000000 +0900
+++ work/block/ll_rw_blk.c 2005-11-18 00:35:05.000000000 +0900
@@ -290,10 +290,104 @@ static inline void rq_init(request_queue
rq->end_io_data = NULL;
}
+static int __blk_queue_ordered(request_queue_t *q, int ordered,
+ prepare_flush_fn *prepare_flush_fn,
+ unsigned gfp_mask, int locked)
+{
+ void *rq0 = NULL, *rq1 = NULL, *rq2 = NULL;
+ unsigned long flags = 0; /* shut up, gcc */
+ int ret = 0;
+
+ might_sleep_if(gfp_mask & __GFP_WAIT);
+
+ if (ordered & (QUEUE_ORDERED_PREFLUSH | QUEUE_ORDERED_POSTFLUSH) &&
+ prepare_flush_fn == NULL) {
+ printk(KERN_ERR "blk_queue_ordered: prepare_flush_fn required\n");
+ ret = -EINVAL;
+ goto out_unlocked;
+ }
+
+ if (!locked)
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ if (q->ordseq) {
+ /* Queue flushing in progress. */
+ ret = -EINPROGRESS;
+ goto out;
+ }
+
+ switch (ordered) {
+ case QUEUE_ORDERED_NONE:
+ if (!q->pre_flush_rq) {
+ rq0 = q->pre_flush_rq;
+ rq1 = q->bar_rq;
+ rq2 = q->post_flush_rq;
+ q->pre_flush_rq = NULL;
+ q->bar_rq = NULL;
+ q->post_flush_rq = NULL;
+ }
+ break;
+
+ case QUEUE_ORDERED_DRAIN:
+ case QUEUE_ORDERED_DRAIN_FLUSH:
+ case QUEUE_ORDERED_DRAIN_FUA:
+ case QUEUE_ORDERED_TAG:
+ case QUEUE_ORDERED_TAG_FLUSH:
+ case QUEUE_ORDERED_TAG_FUA:
+ if (q->pre_flush_rq)
+ break;
+
+ if (!locked)
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ if (!(rq0 = kmem_cache_alloc(request_cachep, gfp_mask)) ||
+ !(rq1 = kmem_cache_alloc(request_cachep, gfp_mask)) ||
+ !(rq2 = kmem_cache_alloc(request_cachep, gfp_mask))) {
+ printk(KERN_ERR "blk_queue_ordered: failed to "
+ "switch to 0x%x (out of memory)\n", ordered);
+ ret = -ENOMEM;
+ goto out_unlocked;
+ }
+ if (!locked)
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ if (!q->pre_flush_rq) {
+ q->pre_flush_rq = rq0;
+ q->bar_rq = rq1;
+ q->post_flush_rq = rq2;
+ rq0 = NULL;
+ rq1 = NULL;
+ rq2 = NULL;
+ }
+ break;
+
+ default:
+ printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ q->ordered = ordered;
+ q->prepare_flush_fn = prepare_flush_fn;
+
+ out:
+ if (!locked)
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ out_unlocked:
+ if (rq0)
+ kmem_cache_free(request_cachep, rq0);
+ if (rq1)
+ kmem_cache_free(request_cachep, rq1);
+ if (rq2)
+ kmem_cache_free(request_cachep, rq2);
+
+ return ret;
+}
+
/**
* blk_queue_ordered - does this queue support ordered writes
- * @q: the request queue
- * @flag: see below
+ * @q: the request queue
+ * @ordered: one of QUEUE_ORDERED_*
+ * @gfp_mask: allocation mask
*
* Description:
* For journalled file systems, doing ordered writes on a commit
@@ -302,32 +396,24 @@ static inline void rq_init(request_queue
* feature should call this function and indicate so.
*
**/
-void blk_queue_ordered(request_queue_t *q, int flag)
+extern int blk_queue_ordered(request_queue_t *q, unsigned ordered,
+ prepare_flush_fn *prepare_flush_fn,
+ unsigned gfp_mask)
{
- switch (flag) {
- case QUEUE_ORDERED_NONE:
- if (q->flush_rq)
- kmem_cache_free(request_cachep, q->flush_rq);
- q->flush_rq = NULL;
- q->ordered = flag;
- break;
- case QUEUE_ORDERED_TAG:
- q->ordered = flag;
- break;
- case QUEUE_ORDERED_FLUSH:
- q->ordered = flag;
- if (!q->flush_rq)
- q->flush_rq = kmem_cache_alloc(request_cachep,
- GFP_KERNEL);
- break;
- default:
- printk("blk_queue_ordered: bad value %d\n", flag);
- break;
- }
+ return __blk_queue_ordered(q, ordered, prepare_flush_fn, gfp_mask, 0);
}
EXPORT_SYMBOL(blk_queue_ordered);
+extern int blk_queue_ordered_locked(request_queue_t *q, unsigned ordered,
+ prepare_flush_fn *prepare_flush_fn,
+ unsigned gfp_mask)
+{
+ return __blk_queue_ordered(q, ordered, prepare_flush_fn, gfp_mask, 1);
+}
+
+EXPORT_SYMBOL(blk_queue_ordered_locked);
+
/**
* blk_queue_issue_flush_fn - set function for issuing a flush
* @q: the request queue
@@ -348,169 +434,265 @@ EXPORT_SYMBOL(blk_queue_issue_flush_fn);
/*
* Cache flushing for ordered writes handling
*/
-static void blk_pre_flush_end_io(struct request *flush_rq, int error)
+inline unsigned blk_ordered_cur_seq(request_queue_t *q)
{
- struct request *rq = flush_rq->end_io_data;
- request_queue_t *q = rq->q;
+ if (!q->ordseq)
+ return 0;
+ return 1 << ffz(q->ordseq);
+}
- elv_completed_request(q, flush_rq);
+unsigned blk_ordered_req_seq(struct request *rq)
+{
+ request_queue_t *q = rq->q;
- rq->flags |= REQ_BAR_PREFLUSH;
+ BUG_ON(q->ordseq == 0);
- if (!flush_rq->errors)
- elv_requeue_request(q, rq);
- else {
- q->end_flush_fn(q, flush_rq);
- clear_bit(QUEUE_FLAG_FLUSH, &q->queue_flags);
- q->request_fn(q);
- }
+ if (rq == q->pre_flush_rq)
+ return QUEUE_ORDSEQ_PREFLUSH;
+ if (rq == q->bar_rq)
+ return QUEUE_ORDSEQ_BAR;
+ if (rq == q->post_flush_rq)
+ return QUEUE_ORDSEQ_POSTFLUSH;
+
+ if ((rq->flags & REQ_ORDERED_COLOR) ==
+ (q->orig_bar_rq->flags & REQ_ORDERED_COLOR))
+ return QUEUE_ORDSEQ_DRAIN;
+ else
+ return QUEUE_ORDSEQ_DONE;
}
-static void blk_post_flush_end_io(struct request *flush_rq, int error)
+void blk_ordered_complete_seq(request_queue_t *q, unsigned seq, int error)
{
- struct request *rq = flush_rq->end_io_data;
- request_queue_t *q = rq->q;
+ struct request *rq;
+ int uptodate;
- elv_completed_request(q, flush_rq);
+ if (error && !q->orderr)
+ q->orderr = error;
- rq->flags |= REQ_BAR_POSTFLUSH;
+ BUG_ON(q->ordseq & seq);
+ q->ordseq |= seq;
- q->end_flush_fn(q, flush_rq);
- clear_bit(QUEUE_FLAG_FLUSH, &q->queue_flags);
- q->request_fn(q);
+ if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE)
+ return;
+
+ /*
+ * Okay, sequence complete.
+ */
+ rq = q->orig_bar_rq;
+ uptodate = q->orderr ? q->orderr : 1;
+
+ q->ordseq = 0;
+
+ end_that_request_first(rq, uptodate, rq->hard_nr_sectors);
+ end_that_request_last(rq, uptodate);
}
-struct request *blk_start_pre_flush(request_queue_t *q, struct request *rq)
+static void pre_flush_end_io(struct request *rq, int error)
{
- struct request *flush_rq = q->flush_rq;
+ elv_completed_request(rq->q, rq);
+ blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_PREFLUSH, error);
+}
- BUG_ON(!blk_barrier_rq(rq));
+static void bar_end_io(struct request *rq, int error)
+{
+ elv_completed_request(rq->q, rq);
+ blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_BAR, error);
+}
- if (test_and_set_bit(QUEUE_FLAG_FLUSH, &q->queue_flags))
- return NULL;
+static void post_flush_end_io(struct request *rq, int error)
+{
+ elv_completed_request(rq->q, rq);
+ blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
+}
- rq_init(q, flush_rq);
- flush_rq->elevator_private = NULL;
- flush_rq->flags = REQ_BAR_FLUSH;
- flush_rq->rq_disk = rq->rq_disk;
- flush_rq->rl = NULL;
+static void queue_flush(request_queue_t *q, unsigned which)
+{
+ struct request *rq;
+ rq_end_io_fn *end_io;
+
+ if (which == QUEUE_ORDERED_PREFLUSH) {
+ rq = q->pre_flush_rq;
+ end_io = pre_flush_end_io;
+ } else {
+ rq = q->post_flush_rq;
+ end_io = post_flush_end_io;
+ }
+
+ rq_init(q, rq);
+ rq->flags = REQ_HARDBARRIER;
+ rq->elevator_private = NULL;
+ rq->rq_disk = q->bar_rq->rq_disk;
+ rq->rl = NULL;
+ rq->end_io = end_io;
+ q->prepare_flush_fn(q, rq);
+
+ __elv_add_request(q, rq, ELEVATOR_INSERT_FRONT, 0);
+}
+
+static inline struct request *start_ordered(request_queue_t *q,
+ struct request *rq)
+{
+ q->bi_size = 0;
+ q->orderr = 0;
+ q->ordseq |= QUEUE_ORDSEQ_STARTED;
/*
- * prepare_flush returns 0 if no flush is needed, just mark both
- * pre and post flush as done in that case
+ * Prep proxy barrier request.
*/
- if (!q->prepare_flush_fn(q, flush_rq)) {
- rq->flags |= REQ_BAR_PREFLUSH | REQ_BAR_POSTFLUSH;
- clear_bit(QUEUE_FLAG_FLUSH, &q->queue_flags);
- return rq;
- }
+ blkdev_dequeue_request(rq);
+ q->orig_bar_rq = rq;
+ rq = q->bar_rq;
+ rq_init(q, rq);
+ rq->flags = bio_data_dir(q->orig_bar_rq->bio);
+ rq->flags |= q->ordered & QUEUE_ORDERED_FUA ? REQ_FUA : 0;
+ rq->elevator_private = NULL;
+ rq->rl = NULL;
+ init_request_from_bio(rq, q->orig_bar_rq->bio);
+ rq->end_io = bar_end_io;
/*
- * some drivers dequeue requests right away, some only after io
- * completion. make sure the request is dequeued.
+ * Queue ordered sequence. As we stack them at the head, we
+ * need to queue in reverse order. Note that we rely on that
+ * no fs request uses ELEVATOR_INSERT_FRONT and thus no fs
+ * request gets inbetween ordered sequence.
*/
- if (!list_empty(&rq->queuelist))
- blkdev_dequeue_request(rq);
+ if (q->ordered & QUEUE_ORDERED_POSTFLUSH)
+ queue_flush(q, QUEUE_ORDERED_POSTFLUSH);
+ else
+ q->ordseq |= QUEUE_ORDSEQ_POSTFLUSH;
+
+ __elv_add_request(q, rq, ELEVATOR_INSERT_FRONT, 0);
- flush_rq->end_io_data = rq;
- flush_rq->end_io = blk_pre_flush_end_io;
+ if (q->ordered & QUEUE_ORDERED_PREFLUSH) {
+ queue_flush(q, QUEUE_ORDERED_PREFLUSH);
+ rq = q->pre_flush_rq;
+ } else
+ q->ordseq |= QUEUE_ORDSEQ_PREFLUSH;
+
+ if ((q->ordered & QUEUE_ORDERED_TAG) || q->in_flight == 0)
+ q->ordseq |= QUEUE_ORDSEQ_DRAIN;
+ else
+ rq = NULL;
- __elv_add_request(q, flush_rq, ELEVATOR_INSERT_FRONT, 0);
- return flush_rq;
+ return rq;
}
-static void blk_start_post_flush(request_queue_t *q, struct request *rq)
+int blk_do_ordered(request_queue_t *q, struct request **rqp)
{
- struct request *flush_rq = q->flush_rq;
+ struct request *rq = *rqp, *allowed_rq;
+ int is_barrier = blk_fs_request(rq) && blk_barrier_rq(rq);
- BUG_ON(!blk_barrier_rq(rq));
+ if (!q->ordseq) {
+ if (!is_barrier)
+ return 1;
+
+ if (q->ordered != QUEUE_ORDERED_NONE) {
+ *rqp = start_ordered(q, rq);
+ return 1;
+ } else {
+ /*
+ * This can happen when the queue switches to
+ * ORDERED_NONE while this request is on it.
+ */
+ blkdev_dequeue_request(rq);
+ end_that_request_first(rq, -EOPNOTSUPP,
+ rq->hard_nr_sectors);
+ end_that_request_last(rq, -EOPNOTSUPP);
+ *rqp = NULL;
+ return 0;
+ }
+ }
- rq_init(q, flush_rq);
- flush_rq->elevator_private = NULL;
- flush_rq->flags = REQ_BAR_FLUSH;
- flush_rq->rq_disk = rq->rq_disk;
- flush_rq->rl = NULL;
+ if (q->ordered & QUEUE_ORDERED_TAG) {
+ if (is_barrier && rq != q->bar_rq)
+ *rqp = NULL;
+ return 1;
+ }
- if (q->prepare_flush_fn(q, flush_rq)) {
- flush_rq->end_io_data = rq;
- flush_rq->end_io = blk_post_flush_end_io;
+ switch (blk_ordered_cur_seq(q)) {
+ case QUEUE_ORDSEQ_PREFLUSH:
+ allowed_rq = q->pre_flush_rq;
+ break;
+ case QUEUE_ORDSEQ_BAR:
+ allowed_rq = q->bar_rq;
+ break;
+ case QUEUE_ORDSEQ_POSTFLUSH:
+ allowed_rq = q->post_flush_rq;
+ break;
+ default:
+ allowed_rq = NULL;
+ break;
+ }
+
+ if (rq != allowed_rq && (blk_fs_request(rq) || rq == q->pre_flush_rq ||
+ rq == q->post_flush_rq))
+ *rqp = NULL;
- __elv_add_request(q, flush_rq, ELEVATOR_INSERT_FRONT, 0);
- q->request_fn(q);
- }
+ return 1;
}
-static inline int blk_check_end_barrier(request_queue_t *q, struct request *rq,
- int sectors)
+static int flush_dry_bio_endio(struct bio *bio, unsigned int bytes, int error)
{
- if (sectors > rq->nr_sectors)
- sectors = rq->nr_sectors;
+ request_queue_t *q = bio->bi_private;
+ struct bio_vec *bvec;
+ int i;
+
+ /*
+ * This is dry run, restore bio_sector and size. We'll finish
+ * this request again with the original bi_end_io after an
+ * error occurs or post flush is complete.
+ */
+ q->bi_size += bytes;
+
+ if (bio->bi_size)
+ return 1;
+
+ /* Rewind bvec's */
+ bio->bi_idx = 0;
+ bio_for_each_segment(bvec, bio, i) {
+ bvec->bv_len += bvec->bv_offset;
+ bvec->bv_offset = 0;
+ }
+
+ /* Reset bio */
+ set_bit(BIO_UPTODATE, &bio->bi_flags);
+ bio->bi_size = q->bi_size;
+ bio->bi_sector -= (q->bi_size >> 9);
+ q->bi_size = 0;
- rq->nr_sectors -= sectors;
- return rq->nr_sectors;
+ return 0;
}
-static int __blk_complete_barrier_rq(request_queue_t *q, struct request *rq,
- int sectors, int queue_locked)
+static inline int ordered_bio_endio(struct request *rq, struct bio *bio,
+ unsigned int nbytes, int error)
{
- if (q->ordered != QUEUE_ORDERED_FLUSH)
- return 0;
- if (!blk_fs_request(rq) || !blk_barrier_rq(rq))
- return 0;
- if (blk_barrier_postflush(rq))
+ request_queue_t *q = rq->q;
+ bio_end_io_t *endio;
+ void *private;
+
+ if (q->bar_rq != rq)
return 0;
- if (!blk_check_end_barrier(q, rq, sectors)) {
- unsigned long flags = 0;
+ /*
+ * Okay, this is the barrier request in progress, dry finish it.
+ */
+ if (error && !q->orderr)
+ q->orderr = error;
- if (!queue_locked)
- spin_lock_irqsave(q->queue_lock, flags);
+ endio = bio->bi_end_io;
+ private = bio->bi_private;
+ bio->bi_end_io = flush_dry_bio_endio;
+ bio->bi_private = q;
- blk_start_post_flush(q, rq);
+ bio_endio(bio, nbytes, error);
- if (!queue_locked)
- spin_unlock_irqrestore(q->queue_lock, flags);
- }
+ bio->bi_end_io = endio;
+ bio->bi_private = private;
return 1;
}
/**
- * blk_complete_barrier_rq - complete possible barrier request
- * @q: the request queue for the device
- * @rq: the request
- * @sectors: number of sectors to complete
- *
- * Description:
- * Used in driver end_io handling to determine whether to postpone
- * completion of a barrier request until a post flush has been done. This
- * is the unlocked variant, used if the caller doesn't already hold the
- * queue lock.
- **/
-int blk_complete_barrier_rq(request_queue_t *q, struct request *rq, int sectors)
-{
- return __blk_complete_barrier_rq(q, rq, sectors, 0);
-}
-EXPORT_SYMBOL(blk_complete_barrier_rq);
-
-/**
- * blk_complete_barrier_rq_locked - complete possible barrier request
- * @q: the request queue for the device
- * @rq: the request
- * @sectors: number of sectors to complete
- *
- * Description:
- * See blk_complete_barrier_rq(). This variant must be used if the caller
- * holds the queue lock.
- **/
-int blk_complete_barrier_rq_locked(request_queue_t *q, struct request *rq,
- int sectors)
-{
- return __blk_complete_barrier_rq(q, rq, sectors, 1);
-}
-EXPORT_SYMBOL(blk_complete_barrier_rq_locked);
-
-/**
* blk_queue_bounce_limit - set bounce buffer limit for queue
* @q: the request queue for the device
* @dma_addr: bus address limit
@@ -1044,6 +1226,7 @@ static char *rq_flags[] = {
"REQ_SORTED",
"REQ_SOFTBARRIER",
"REQ_HARDBARRIER",
+ "REQ_FUA",
"REQ_CMD",
"REQ_NOMERGE",
"REQ_STARTED",
@@ -1063,6 +1246,7 @@ static char *rq_flags[] = {
"REQ_PM_SUSPEND",
"REQ_PM_RESUME",
"REQ_PM_SHUTDOWN",
+ "REQ_ORDERED_COLOR",
};
void blk_dump_rq_flags(struct request *rq, char *msg)
@@ -1627,7 +1811,7 @@ void blk_cleanup_queue(request_queue_t *
if (q->queue_tags)
__blk_queue_free_tags(q);
- blk_queue_ordered(q, QUEUE_ORDERED_NONE);
+ blk_queue_ordered(q, QUEUE_ORDERED_NONE, NULL, 0);
kmem_cache_free(requestq_cachep, q);
}
@@ -3055,7 +3239,8 @@ static int __end_that_request_first(stru
if (nr_bytes >= bio->bi_size) {
req->bio = bio->bi_next;
nbytes = bio->bi_size;
- bio_endio(bio, nbytes, error);
+ if (!ordered_bio_endio(req, bio, nbytes, error))
+ bio_endio(bio, nbytes, error);
next_idx = 0;
bio_nbytes = 0;
} else {
@@ -3110,7 +3295,8 @@ static int __end_that_request_first(stru
* if the request wasn't completed, update state
*/
if (bio_nbytes) {
- bio_endio(bio, bio_nbytes, error);
+ if (!ordered_bio_endio(req, bio, bio_nbytes, error))
+ bio_endio(bio, bio_nbytes, error);
bio->bi_idx += next_idx;
bio_iovec(bio)->bv_offset += nr_bytes;
bio_iovec(bio)->bv_len -= nr_bytes;
Index: work/include/linux/blkdev.h
===================================================================
--- work.orig/include/linux/blkdev.h 2005-11-18 00:35:04.000000000 +0900
+++ work/include/linux/blkdev.h 2005-11-18 00:35:05.000000000 +0900
@@ -206,6 +206,7 @@ enum rq_flag_bits {
__REQ_SORTED, /* elevator knows about this request */
__REQ_SOFTBARRIER, /* may not be passed by ioscheduler */
__REQ_HARDBARRIER, /* may not be passed by drive either */
+ __REQ_FUA, /* forced unit access */
__REQ_CMD, /* is a regular fs rw request */
__REQ_NOMERGE, /* don't touch this for merging */
__REQ_STARTED, /* drive already may have started this one */
@@ -229,9 +230,7 @@ enum rq_flag_bits {
__REQ_PM_SUSPEND, /* suspend request */
__REQ_PM_RESUME, /* resume request */
__REQ_PM_SHUTDOWN, /* shutdown request */
- __REQ_BAR_PREFLUSH, /* barrier pre-flush done */
- __REQ_BAR_POSTFLUSH, /* barrier post-flush */
- __REQ_BAR_FLUSH, /* rq is the flush request */
+ __REQ_ORDERED_COLOR, /* is before or after barrier */
__REQ_NR_BITS, /* stops here */
};
@@ -240,6 +239,7 @@ enum rq_flag_bits {
#define REQ_SORTED (1 << __REQ_SORTED)
#define REQ_SOFTBARRIER (1 << __REQ_SOFTBARRIER)
#define REQ_HARDBARRIER (1 << __REQ_HARDBARRIER)
+#define REQ_FUA (1 << __REQ_FUA)
#define REQ_CMD (1 << __REQ_CMD)
#define REQ_NOMERGE (1 << __REQ_NOMERGE)
#define REQ_STARTED (1 << __REQ_STARTED)
@@ -259,9 +259,7 @@ enum rq_flag_bits {
#define REQ_PM_SUSPEND (1 << __REQ_PM_SUSPEND)
#define REQ_PM_RESUME (1 << __REQ_PM_RESUME)
#define REQ_PM_SHUTDOWN (1 << __REQ_PM_SHUTDOWN)
-#define REQ_BAR_PREFLUSH (1 << __REQ_BAR_PREFLUSH)
-#define REQ_BAR_POSTFLUSH (1 << __REQ_BAR_POSTFLUSH)
-#define REQ_BAR_FLUSH (1 << __REQ_BAR_FLUSH)
+#define REQ_ORDERED_COLOR (1 << __REQ_ORDERED_COLOR)
/*
* State information carried for REQ_PM_SUSPEND and REQ_PM_RESUME
@@ -291,8 +289,7 @@ struct bio_vec;
typedef int (merge_bvec_fn) (request_queue_t *, struct bio *, struct bio_vec *);
typedef void (activity_fn) (void *data, int rw);
typedef int (issue_flush_fn) (request_queue_t *, struct gendisk *, sector_t *);
-typedef int (prepare_flush_fn) (request_queue_t *, struct request *);
-typedef void (end_flush_fn) (request_queue_t *, struct request *);
+typedef void (prepare_flush_fn) (request_queue_t *, struct request *);
enum blk_queue_state {
Queue_down,
@@ -334,7 +331,6 @@ struct request_queue
activity_fn *activity_fn;
issue_flush_fn *issue_flush_fn;
prepare_flush_fn *prepare_flush_fn;
- end_flush_fn *end_flush_fn;
/*
* Dispatch queue sorting
@@ -418,14 +414,11 @@ struct request_queue
/*
* reserved for flush operations
*/
- struct request *flush_rq;
- unsigned char ordered;
-};
-
-enum {
- QUEUE_ORDERED_NONE,
- QUEUE_ORDERED_TAG,
- QUEUE_ORDERED_FLUSH,
+ unsigned int ordered, ordseq;
+ int orderr, ordcolor;
+ struct request *pre_flush_rq, *bar_rq, *post_flush_rq;
+ struct request *orig_bar_rq;
+ unsigned int bi_size;
};
#define RQ_INACTIVE (-1)
@@ -443,12 +436,51 @@ enum {
#define QUEUE_FLAG_REENTER 6 /* Re-entrancy avoidance */
#define QUEUE_FLAG_PLUGGED 7 /* queue is plugged */
#define QUEUE_FLAG_ELVSWITCH 8 /* don't use elevator, just do FIFO */
-#define QUEUE_FLAG_FLUSH 9 /* doing barrier flush sequence */
+
+enum {
+ /*
+ * Hardbarrier is supported with one of the following methods.
+ *
+ * NONE : hardbarrier unsupported
+ * DRAIN : ordering by draining is enough
+ * DRAIN_FLUSH : ordering by draining w/ pre and post flushes
+ * DRAIN_FUA : ordering by draining w/ pre flush and FUA write
+ * TAG : ordering by tag is enough
+ * TAG_FLUSH : ordering by tag w/ pre and post flushes
+ * TAG_FUA : ordering by tag w/ pre flush and FUA write
+ */
+ QUEUE_ORDERED_NONE = 0x00,
+ QUEUE_ORDERED_DRAIN = 0x01,
+ QUEUE_ORDERED_TAG = 0x02,
+
+ QUEUE_ORDERED_PREFLUSH = 0x10,
+ QUEUE_ORDERED_POSTFLUSH = 0x20,
+ QUEUE_ORDERED_FUA = 0x40,
+
+ QUEUE_ORDERED_DRAIN_FLUSH = QUEUE_ORDERED_DRAIN |
+ QUEUE_ORDERED_PREFLUSH | QUEUE_ORDERED_POSTFLUSH,
+ QUEUE_ORDERED_DRAIN_FUA = QUEUE_ORDERED_DRAIN |
+ QUEUE_ORDERED_PREFLUSH | QUEUE_ORDERED_FUA,
+ QUEUE_ORDERED_TAG_FLUSH = QUEUE_ORDERED_TAG |
+ QUEUE_ORDERED_PREFLUSH | QUEUE_ORDERED_POSTFLUSH,
+ QUEUE_ORDERED_TAG_FUA = QUEUE_ORDERED_TAG |
+ QUEUE_ORDERED_PREFLUSH | QUEUE_ORDERED_FUA,
+
+ /*
+ * Ordered operation sequence
+ */
+ QUEUE_ORDSEQ_STARTED = 0x01, /* flushing in progress */
+ QUEUE_ORDSEQ_DRAIN = 0x02, /* waiting for the queue to be drained */
+ QUEUE_ORDSEQ_PREFLUSH = 0x04, /* pre-flushing in progress */
+ QUEUE_ORDSEQ_BAR = 0x08, /* original barrier req in progress */
+ QUEUE_ORDSEQ_POSTFLUSH = 0x10, /* post-flushing in progress */
+ QUEUE_ORDSEQ_DONE = 0x20,
+};
#define blk_queue_plugged(q) test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
#define blk_queue_tagged(q) test_bit(QUEUE_FLAG_QUEUED, &(q)->queue_flags)
#define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
-#define blk_queue_flushing(q) test_bit(QUEUE_FLAG_FLUSH, &(q)->queue_flags)
+#define blk_queue_flushing(q) ((q)->ordseq)
#define blk_fs_request(rq) ((rq)->flags & REQ_CMD)
#define blk_pc_request(rq) ((rq)->flags & REQ_BLOCK_PC)
@@ -464,8 +496,7 @@ enum {
#define blk_sorted_rq(rq) ((rq)->flags & REQ_SORTED)
#define blk_barrier_rq(rq) ((rq)->flags & REQ_HARDBARRIER)
-#define blk_barrier_preflush(rq) ((rq)->flags & REQ_BAR_PREFLUSH)
-#define blk_barrier_postflush(rq) ((rq)->flags & REQ_BAR_POSTFLUSH)
+#define blk_fua_rq(rq) ((rq)->flags & REQ_FUA)
#define list_entry_rq(ptr) list_entry((ptr), struct request, queuelist)
@@ -657,11 +688,13 @@ extern void blk_queue_prep_rq(request_qu
extern void blk_queue_merge_bvec(request_queue_t *, merge_bvec_fn *);
extern void blk_queue_dma_alignment(request_queue_t *, int);
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
-extern void blk_queue_ordered(request_queue_t *, int);
+extern int blk_queue_ordered(request_queue_t *, unsigned, prepare_flush_fn *, unsigned);
+extern int blk_queue_ordered_locked(request_queue_t *, unsigned, prepare_flush_fn *, unsigned);
extern void blk_queue_issue_flush_fn(request_queue_t *, issue_flush_fn *);
-extern struct request *blk_start_pre_flush(request_queue_t *,struct request *);
-extern int blk_complete_barrier_rq(request_queue_t *, struct request *, int);
-extern int blk_complete_barrier_rq_locked(request_queue_t *, struct request *, int);
+extern int blk_do_ordered(request_queue_t *, struct request **);
+extern unsigned blk_ordered_cur_seq(request_queue_t *);
+extern unsigned blk_ordered_req_seq(struct request *);
+extern void blk_ordered_complete_seq(request_queue_t *, unsigned, int);
extern int blk_rq_map_sg(request_queue_t *, struct request *, struct scatterlist *);
extern void blk_dump_rq_flags(struct request *, char *);
Index: work/include/linux/elevator.h
===================================================================
--- work.orig/include/linux/elevator.h 2005-11-18 00:14:29.000000000 +0900
+++ work/include/linux/elevator.h 2005-11-18 00:35:05.000000000 +0900
@@ -130,6 +130,7 @@ extern int elv_try_last_merge(request_qu
#define ELEVATOR_INSERT_FRONT 1
#define ELEVATOR_INSERT_BACK 2
#define ELEVATOR_INSERT_SORT 3
+#define ELEVATOR_INSERT_REQUEUE 4
/*
* return values from elevator_may_queue_fn
10_blk_add-barrier-doc.patch
I/O barrier documentation
Signed-off-by: Tejun Heo <[email protected]>
barrier.txt | 271 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
biodoc.txt | 10 --
2 files changed, 273 insertions(+), 8 deletions(-)
Index: work/Documentation/block/barrier.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ work/Documentation/block/barrier.txt 2005-11-18 00:35:07.000000000 +0900
@@ -0,0 +1,271 @@
+I/O Barriers
+============
+Tejun Heo <[email protected]>, July 22 2005
+
+I/O barrier requests are used to guarantee ordering around the barrier
+requests. Unless you're crazy enough to use disk drives for
+implementing synchronization constructs (wow, sounds interesting...),
+the ordering is meaningful only for write requests for things like
+journal checkpoints. All requests queued before a barrier request
+must be finished (made it to the physical medium) before the barrier
+request is started, and all requests queued after the barrier request
+must be started only after the barrier request is finished (again,
+made it to the physical medium).
+
+In other words, I/O barrier requests have the following two properties.
+
+1. Request ordering
+
+Requests cannot pass the barrier request. Preceding requests are
+processed before the barrier and following requests after.
+
+Depending on what features a drive supports, this can be done in one
+of the following three ways.
+
+i. For devices which have queue depth greater than 1 (TCQ devices) and
+support ordered tags, block layer can just issue the barrier as an
+ordered request and the lower level driver, controller and drive
+itself are responsible for making sure that the ordering contraint is
+met. Most modern SCSI controllers/drives should support this.
+
+NOTE: SCSI ordered tag isn't currently used due to limitation in the
+ SCSI midlayer, see the following random notes section.
+
+ii. For devices which have queue depth greater than 1 but don't
+support ordered tags, block layer ensures that the requests preceding
+a barrier request finishes before issuing the barrier request. Also,
+it defers requests following the barrier until the barrier request is
+finished. Older SCSI controllers/drives and SATA drives fall in this
+category.
+
+iii. Devices which have queue depth of 1. This is a degenerate case
+of ii. Just keeping issue order suffices. Ancient SCSI
+controllers/drives and IDE drives are in this category.
+
+2. Forced flushing to physcial medium
+
+Again, if you're not gonna do synchronization with disk drives (dang,
+it sounds even more appealing now!), the reason you use I/O barriers
+is mainly to protect filesystem integrity when power failure or some
+other events abruptly stop the drive from operating and possibly make
+the drive lose data in its cache. So, I/O barriers need to guarantee
+that requests actually get written to non-volatile medium in order.
+
+There are four cases,
+
+i. No write-back cache. Keeping requests ordered is enough.
+
+ii. Write-back cache but no flush operation. There's no way to
+gurantee physical-medium commit order. This kind of devices can't to
+I/O barriers.
+
+iii. Write-back cache and flush operation but no FUA (forced unit
+access). We need two cache flushes - before and after the barrier
+request.
+
+iv. Write-back cache, flush operation and FUA. We still need one
+flush to make sure requests preceding a barrier are written to medium,
+but post-barrier flush can be avoided by using FUA write on the
+barrier itself.
+
+
+How to support barrier requests in drivers
+------------------------------------------
+
+All barrier handling is done inside block layer proper. All low level
+drivers have to are implementing its prepare_flush_fn and using one
+the following two functions to indicate what barrier type it supports
+and how to prepare flush requests. Note that the term 'ordered' is
+used to indicate the whole sequence of performing barrier requests
+including draining and flushing.
+
+typedef void (prepare_flush_fn)(request_queue_t *q, struct request *rq);
+
+int blk_queue_ordered(request_queue_t *q, unsigned ordered,
+ prepare_flush_fn *prepare_flush_fn,
+ unsigned gfp_mask);
+
+int blk_queue_ordered_locked(request_queue_t *q, unsigned ordered,
+ prepare_flush_fn *prepare_flush_fn,
+ unsigned gfp_mask);
+
+The only difference between the two functions is whether or not the
+caller is holding q->queue_lock on entry. The latter expects the
+caller is holding the lock.
+
+@q : the queue in question
+@ordered : the ordered mode the driver/device supports
+@prepare_flush_fn : this function should prepare @rq such that it
+ flushes cache to physical medium when executed
+@gfp_mask : gfp_mask used when allocating data structures
+ for ordered processing
+
+For example, SCSI disk driver's prepare_flush_fn looks like the
+following.
+
+static void sd_prepare_flush(request_queue_t *q, struct request *rq)
+{
+ memset(rq->cmd, 0, sizeof(rq->cmd));
+ rq->flags |= REQ_BLOCK_PC;
+ rq->timeout = SD_TIMEOUT;
+ rq->cmd[0] = SYNCHRONIZE_CACHE;
+}
+
+The following seven ordered modes are supported. The following table
+shows which mode should be used depending on what features a
+device/driver supports. In the leftmost column of table,
+QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
+
+The table is followed by description of each mode. Note that in the
+descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
+used for QUEUE_ORDERED_TAG* descriptions. '=>' indicates that the
+preceding step must be complete before proceeding to the next step.
+'->' indicates that the next step can start as soon as the previous
+step is issued.
+
+ write-back cache ordered tag flush FUA
+-----------------------------------------------------------------------
+NONE yes/no N/A no N/A
+DRAIN no no N/A N/A
+DRAIN_FLUSH yes no yes no
+DRAIN_FUA yes no yes yes
+TAG no yes N/A N/A
+TAG_FLUSH yes yes yes no
+TAG_FUA yes yes yes yes
+
+
+QUEUE_ORDERED_NONE
+ I/O barriers are not needed and/or supported.
+
+ Sequence: N/A
+
+QUEUE_ORDERED_DRAIN
+ Requests are ordered by draining the request queue and cache
+ flushing isn't needed.
+
+ Sequence: drain => barrier
+
+QUEUE_ORDERED_DRAIN_FLUSH
+ Requests are ordered by draining the request queue and both
+ pre-barrier and post-barrier cache flushings are needed.
+
+ Sequence: drain => preflush => barrier => postflush
+
+QUEUE_ORDERED_DRAIN_FUA
+ Requests are ordered by draining the request queue and
+ pre-barrier cache flushing is needed. By using FUA on barrier
+ request, post-barrier flushing can be skipped.
+
+ Sequence: drain => preflush => barrier
+
+QUEUE_ORDERED_TAG
+ Requests are ordered by ordered tag and cache flushing isn't
+ needed.
+
+ Sequence: barrier
+
+QUEUE_ORDERED_TAG_FLUSH
+ Requests are ordered by ordered tag and both pre-barrier and
+ post-barrier cache flushings are needed.
+
+ Sequence: preflush -> barrier -> postflush
+
+QUEUE_ORDERED_TAG_FUA
+ Requests are ordered by ordered tag and pre-barrier cache
+ flushing is needed. By using FUA on barrier request,
+ post-barrier flushing can be skipped.
+
+ Sequence: preflush -> barrier
+
+
+Random notes/caveats
+--------------------
+
+* SCSI layer currently can't use TAG ordering even if the drive,
+controller and driver support it. The problem is that SCSI midlayer
+request dispatch function is not atomic. It releases queue lock and
+switch to SCSI host lock during issue and it's possible and likely to
+happen in time that requests change their relative positions. Once
+this problem is solved, TAG ordering can be enabled.
+
+* Currently, no matter which ordered mode is used, there can be only
+one barrier request in progress. All I/O barriers are held off by
+block layer until the previous I/O barrier is complete. This doesn't
+make any difference for DRAIN ordered devices, but, for TAG ordered
+devices with very high command latency, passing multiple I/O barriers
+to low level *might* be helpful if they are very frequent. Well, this
+certainly is a non-issue. I'm writing this just to make clear that no
+two I/O barrier is ever passed to low-level driver.
+
+* Completion order. Requests in ordered sequence are issued in order
+but not required to finish in order. Barrier implementation can
+handle out-of-order completion of ordered sequence. IOW, the requests
+MUST be processed in order but the hardware/software completion paths
+are allowed to reorder completion notifications - eg. current SCSI
+midlayer doesn't preserve completion order during error handling.
+
+* Requeueing order. Low-level drivers are free to requeue any request
+after they removed it from the request queue with
+blkdev_dequeue_request(). As barrier sequence should be kept in order
+when requeued, generic elevator code takes care of putting requests in
+order around barrier. See blk_ordered_req_seq() and
+ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
+
+Note that block drivers must not requeue preceding requests while
+completing latter requests in an ordered sequence. Currently, no
+error checking is done against this.
+
+* Error handling. Currently, block layer will report error to upper
+layer if any of requests in an ordered sequence fails. Unfortunately,
+this doesn't seem to be enough. Look at the following request flow.
+QUEUE_ORDERED_TAG_FLUSH is in use.
+
+ [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
+ still in elevator
+
+Let's say request [2], [3] are write requests to update file system
+metadata (journal or whatever) and [barrier] is used to mark that
+those updates are valid. Consider the following sequence.
+
+ i. Requests [0] ~ [post] leaves the request queue and enters
+ low-level driver.
+ ii. After a while, unfortunately, something goes wrong and the
+ drive fails [2]. Note that any of [0], [1] and [3] could have
+ completed by this time, but [pre] couldn't have been finished
+ as the drive must process it in order and it failed before
+ processing that command.
+ iii. Error handling kicks in and determines that the error is
+ unrecoverable and fails [2], and resumes operation.
+ iv. [pre] [barrier] [post] gets processed.
+ v. *BOOM* power fails
+
+The problem here is that the barrier request is *supposed* to indicate
+that filesystem update requests [2] and [3] made it safely to the
+physical medium and, if the machine crashes after the barrier is
+written, filesystem recovery code can depend on that. Sadly, that
+isn't true in this case anymore. IOW, the success of a I/O barrier
+should also be dependent on success of some of the preceding requests,
+where only upper layer (filesystem) knows what 'some' is.
+
+This can be solved by implementing a way to tell the block layer which
+requests affect the success of the following barrier request and
+making lower lever drivers to resume operation on error only after
+block layer tells it to do so.
+
+As the probability of this happening is very low and the drive should
+be faulty, implementing the fix is probably an overkill. But, still,
+it's there.
+
+* In previous drafts of barrier implementation, there was fallback
+mechanism such that, if FUA or ordered TAG fails, less fancy ordered
+mode can be selected and the failed barrier request is retried
+automatically. The rationale for this feature was that as FUA is
+pretty new in ATA world and ordered tag was never used widely, there
+could be devices which report to support those features but choke when
+actually given such requests.
+
+ This was removed for two reasons 1. it's an overkill 2. it's
+impossible to implement properly when TAG ordering is used as low
+level drivers resume after an error automatically. If it's ever
+needed adding it back and modifying low level drivers accordingly
+shouldn't be difficult.
Index: work/Documentation/block/biodoc.txt
===================================================================
--- work.orig/Documentation/block/biodoc.txt 2005-11-18 00:14:20.000000000 +0900
+++ work/Documentation/block/biodoc.txt 2005-11-18 00:35:07.000000000 +0900
@@ -263,14 +263,8 @@ A flag in the bio structure, BIO_BARRIER
The generic i/o scheduler would make sure that it places the barrier request and
all other requests coming after it after all the previous requests in the
queue. Barriers may be implemented in different ways depending on the
-driver. A SCSI driver for example could make use of ordered tags to
-preserve the necessary ordering with a lower impact on throughput. For IDE
-this might be two sync cache flush: a pre and post flush when encountering
-a barrier write.
-
-There is a provision for queues to indicate what kind of barriers they
-can provide. This is as of yet unmerged, details will be added here once it
-is in the kernel.
+driver. For more details regarding I/O barriers, please read barrier.txt
+in this directory.
1.2.2 Request Priority/Latency
08_blk_ide-update-ordered.patch
Update IDE to use new blk_ordered.
Signed-off-by: Tejun Heo <[email protected]>
ide-disk.c | 57 +++++++++------------------------------------------------
ide-io.c | 5 +----
2 files changed, 10 insertions(+), 52 deletions(-)
Index: work/drivers/ide/ide-disk.c
===================================================================
--- work.orig/drivers/ide/ide-disk.c 2005-11-18 00:14:21.000000000 +0900
+++ work/drivers/ide/ide-disk.c 2005-11-18 00:35:06.000000000 +0900
@@ -681,50 +681,9 @@ static ide_proc_entry_t idedisk_proc[] =
#endif /* CONFIG_PROC_FS */
-static void idedisk_end_flush(request_queue_t *q, struct request *flush_rq)
+static void idedisk_prepare_flush(request_queue_t *q, struct request *rq)
{
ide_drive_t *drive = q->queuedata;
- struct request *rq = flush_rq->end_io_data;
- int good_sectors = rq->hard_nr_sectors;
- int bad_sectors;
- sector_t sector;
-
- if (flush_rq->errors & ABRT_ERR) {
- printk(KERN_ERR "%s: barrier support doesn't work\n", drive->name);
- blk_queue_ordered(drive->queue, QUEUE_ORDERED_NONE);
- blk_queue_issue_flush_fn(drive->queue, NULL);
- good_sectors = 0;
- } else if (flush_rq->errors) {
- good_sectors = 0;
- if (blk_barrier_preflush(rq)) {
- sector = ide_get_error_location(drive,flush_rq->buffer);
- if ((sector >= rq->hard_sector) &&
- (sector < rq->hard_sector + rq->hard_nr_sectors))
- good_sectors = sector - rq->hard_sector;
- }
- }
-
- if (flush_rq->errors)
- printk(KERN_ERR "%s: failed barrier write: "
- "sector=%Lx(good=%d/bad=%d)\n",
- drive->name, (unsigned long long)rq->sector,
- good_sectors,
- (int) (rq->hard_nr_sectors-good_sectors));
-
- bad_sectors = rq->hard_nr_sectors - good_sectors;
-
- if (good_sectors)
- __ide_end_request(drive, rq, 1, good_sectors);
- if (bad_sectors)
- __ide_end_request(drive, rq, 0, bad_sectors);
-}
-
-static int idedisk_prepare_flush(request_queue_t *q, struct request *rq)
-{
- ide_drive_t *drive = q->queuedata;
-
- if (!drive->wcache)
- return 0;
memset(rq->cmd, 0, sizeof(rq->cmd));
@@ -735,9 +694,8 @@ static int idedisk_prepare_flush(request
rq->cmd[0] = WIN_FLUSH_CACHE;
- rq->flags |= REQ_DRIVE_TASK | REQ_SOFTBARRIER;
+ rq->flags |= REQ_DRIVE_TASK;
rq->buffer = rq->cmd;
- return 1;
}
static int idedisk_issue_flush(request_queue_t *q, struct gendisk *disk,
@@ -1012,11 +970,12 @@ static void idedisk_setup (ide_drive_t *
printk(KERN_INFO "%s: cache flushes %ssupported\n",
drive->name, barrier ? "" : "not ");
if (barrier) {
- blk_queue_ordered(drive->queue, QUEUE_ORDERED_FLUSH);
- drive->queue->prepare_flush_fn = idedisk_prepare_flush;
- drive->queue->end_flush_fn = idedisk_end_flush;
+ blk_queue_ordered(drive->queue, QUEUE_ORDERED_DRAIN_FLUSH,
+ idedisk_prepare_flush, GFP_KERNEL);
blk_queue_issue_flush_fn(drive->queue, idedisk_issue_flush);
- }
+ } else if (!drive->wcache)
+ blk_queue_ordered(drive->queue, QUEUE_ORDERED_DRAIN,
+ NULL, GFP_KERNEL);
}
static void ide_cacheflush_p(ide_drive_t *drive)
@@ -1034,6 +993,8 @@ static int ide_disk_remove(struct device
struct ide_disk_obj *idkp = drive->driver_data;
struct gendisk *g = idkp->disk;
+ blk_queue_ordered(drive->queue, QUEUE_ORDERED_NONE, NULL, 0);
+
ide_cacheflush_p(drive);
ide_unregister_subdriver(drive, idkp->driver);
Index: work/drivers/ide/ide-io.c
===================================================================
--- work.orig/drivers/ide/ide-io.c 2005-11-18 00:35:04.000000000 +0900
+++ work/drivers/ide/ide-io.c 2005-11-18 00:35:06.000000000 +0900
@@ -119,10 +119,7 @@ int ide_end_request (ide_drive_t *drive,
if (!nr_sectors)
nr_sectors = rq->hard_cur_sectors;
- if (blk_complete_barrier_rq_locked(drive->queue, rq, nr_sectors))
- ret = rq->nr_sectors != 0;
- else
- ret = __ide_end_request(drive, rq, uptodate, nr_sectors);
+ ret = __ide_end_request(drive, rq, uptodate, nr_sectors);
spin_unlock_irqrestore(&ide_lock, flags);
return ret;
Tejun Heo wrote:
> * libata has gone through some changes since last posting. libata
> updates are rewritten. Jeff, please review #07 and #08.
Look OK to me.
Jeff
Hi,
On 11/17/05, Tejun Heo <[email protected]> wrote:
> 08_blk_ide-update-ordered.patch
>
> Update IDE to use new blk_ordered.
>
> Signed-off-by: Tejun Heo <[email protected]>
>
> ide-disk.c | 57 +++++++++------------------------------------------------
> ide-io.c | 5 +----
> 2 files changed, 10 insertions(+), 52 deletions(-)
>
> Index: work/drivers/ide/ide-disk.c
> ===================================================================
> --- work.orig/drivers/ide/ide-disk.c 2005-11-18 00:14:21.000000000 +0900
> +++ work/drivers/ide/ide-disk.c 2005-11-18 00:35:06.000000000 +0900
> @@ -681,50 +681,9 @@ static ide_proc_entry_t idedisk_proc[] =
>
> #endif /* CONFIG_PROC_FS */
>
> -static void idedisk_end_flush(request_queue_t *q, struct request *flush_rq)
> +static void idedisk_prepare_flush(request_queue_t *q, struct request *rq)
> {
> ide_drive_t *drive = q->queuedata;
> - struct request *rq = flush_rq->end_io_data;
> - int good_sectors = rq->hard_nr_sectors;
> - int bad_sectors;
> - sector_t sector;
> -
> - if (flush_rq->errors & ABRT_ERR) {
> - printk(KERN_ERR "%s: barrier support doesn't work\n", drive->name);
> - blk_queue_ordered(drive->queue, QUEUE_ORDERED_NONE);
> - blk_queue_issue_flush_fn(drive->queue, NULL);
> - good_sectors = 0;
> - } else if (flush_rq->errors) {
> - good_sectors = 0;
> - if (blk_barrier_preflush(rq)) {
> - sector = ide_get_error_location(drive,flush_rq->buffer);
> - if ((sector >= rq->hard_sector) &&
> - (sector < rq->hard_sector + rq->hard_nr_sectors))
> - good_sectors = sector - rq->hard_sector;
> - }
> - }
> -
> - if (flush_rq->errors)
> - printk(KERN_ERR "%s: failed barrier write: "
> - "sector=%Lx(good=%d/bad=%d)\n",
> - drive->name, (unsigned long long)rq->sector,
> - good_sectors,
> - (int) (rq->hard_nr_sectors-good_sectors));
> -
> - bad_sectors = rq->hard_nr_sectors - good_sectors;
> -
> - if (good_sectors)
> - __ide_end_request(drive, rq, 1, good_sectors);
> - if (bad_sectors)
> - __ide_end_request(drive, rq, 0, bad_sectors);
> -}
I fail to see how the partial completions (good + bad sectors)
are done in your new scheme, please explain.
> -
> -static int idedisk_prepare_flush(request_queue_t *q, struct request *rq)
> -{
> - ide_drive_t *drive = q->queuedata;
> -
> - if (!drive->wcache)
> - return 0;
What does happen if somebody disables drive->wcache later?
> memset(rq->cmd, 0, sizeof(rq->cmd));
>
> @@ -735,9 +694,8 @@ static int idedisk_prepare_flush(request
> rq->cmd[0] = WIN_FLUSH_CACHE;
>
>
> - rq->flags |= REQ_DRIVE_TASK | REQ_SOFTBARRIER;
> + rq->flags |= REQ_DRIVE_TASK;
> rq->buffer = rq->cmd;
> - return 1;
> }
>
> static int idedisk_issue_flush(request_queue_t *q, struct gendisk *disk,
> @@ -1012,11 +970,12 @@ static void idedisk_setup (ide_drive_t *
> printk(KERN_INFO "%s: cache flushes %ssupported\n",
> drive->name, barrier ? "" : "not ");
> if (barrier) {
> - blk_queue_ordered(drive->queue, QUEUE_ORDERED_FLUSH);
> - drive->queue->prepare_flush_fn = idedisk_prepare_flush;
> - drive->queue->end_flush_fn = idedisk_end_flush;
> + blk_queue_ordered(drive->queue, QUEUE_ORDERED_DRAIN_FLUSH,
> + idedisk_prepare_flush, GFP_KERNEL);
> blk_queue_issue_flush_fn(drive->queue, idedisk_issue_flush);
> - }
> + } else if (!drive->wcache)
> + blk_queue_ordered(drive->queue, QUEUE_ORDERED_DRAIN,
> + NULL, GFP_KERNEL);
What does happen if somebody enables drive->wcache later?
> }
>
> static void ide_cacheflush_p(ide_drive_t *drive)
> @@ -1034,6 +993,8 @@ static int ide_disk_remove(struct device
> struct ide_disk_obj *idkp = drive->driver_data;
> struct gendisk *g = idkp->disk;
>
> + blk_queue_ordered(drive->queue, QUEUE_ORDERED_NONE, NULL, 0);
> +
Shouldn't this be done in ide_disk_release()?
> ide_cacheflush_p(drive);
>
> ide_unregister_subdriver(drive, idkp->driver);
> Index: work/drivers/ide/ide-io.c
> ===================================================================
> --- work.orig/drivers/ide/ide-io.c 2005-11-18 00:35:04.000000000 +0900
> +++ work/drivers/ide/ide-io.c 2005-11-18 00:35:06.000000000 +0900
> @@ -119,10 +119,7 @@ int ide_end_request (ide_drive_t *drive,
> if (!nr_sectors)
> nr_sectors = rq->hard_cur_sectors;
>
> - if (blk_complete_barrier_rq_locked(drive->queue, rq, nr_sectors))
> - ret = rq->nr_sectors != 0;
> - else
> - ret = __ide_end_request(drive, rq, uptodate, nr_sectors);
> + ret = __ide_end_request(drive, rq, uptodate, nr_sectors);
>
> spin_unlock_irqrestore(&ide_lock, flags);
> return ret;
>
On 11/17/05, Tejun Heo <[email protected]> wrote:
> 09_blk_ide-add-fua-support.patch
>
> Add FUA support to IDE
>
> Signed-off-by: Tejun Heo <[email protected]>
>
> drivers/ide/ide-disk.c | 57 +++++++++++++++++++++++++++++++++++++++++--------
> include/linux/hdreg.h | 16 ++++++++++++-
> include/linux/ide.h | 3 ++
> 3 files changed, 65 insertions(+), 11 deletions(-)
>
> Index: work/drivers/ide/ide-disk.c
> ===================================================================
> --- work.orig/drivers/ide/ide-disk.c 2005-11-18 00:35:06.000000000 +0900
> +++ work/drivers/ide/ide-disk.c 2005-11-18 00:35:07.000000000 +0900
> @@ -164,13 +164,14 @@ static ide_startstop_t __ide_do_rw_disk(
> ide_hwif_t *hwif = HWIF(drive);
> unsigned int dma = drive->using_dma;
> u8 lba48 = (drive->addressing == 1) ? 1 : 0;
> + int fua = blk_fua_rq(rq);
> task_ioreg_t command = WIN_NOP;
> ata_nsector_t nsectors;
>
> nsectors.all = (u16) rq->nr_sectors;
>
> if (hwif->no_lba48_dma && lba48 && dma) {
> - if (block + rq->nr_sectors > 1ULL << 28)
> + if (block + rq->nr_sectors > 1ULL << 28 || fua)
> dma = 0;
> else
> lba48 = 0;
> @@ -226,6 +227,16 @@ static ide_startstop_t __ide_do_rw_disk(
> hwif->OUTB(tasklets[6], IDE_HCYL_REG);
> hwif->OUTB(0x00|drive->select.all,IDE_SELECT_REG);
> } else {
> + if (unlikely(fua)) {
> + /*
> + * This happens if LBA48 addressing is
> + * turned off during operation.
> + */
> + printk(KERN_ERR "%s: FUA write but LBA48 off\n",
> + drive->name);
> + goto fail;
> + }
> +
> hwif->OUTB(0x00, IDE_FEATURE_REG);
> hwif->OUTB(nsectors.b.low, IDE_NSECTOR_REG);
> hwif->OUTB(block, IDE_SECTOR_REG);
> @@ -253,9 +264,12 @@ static ide_startstop_t __ide_do_rw_disk(
> if (dma) {
> if (!hwif->dma_setup(drive)) {
> if (rq_data_dir(rq)) {
> - command = lba48 ? WIN_WRITEDMA_EXT : WIN_WRITEDMA;
> - if (drive->vdma)
> - command = lba48 ? WIN_WRITE_EXT: WIN_WRITE;
> + if (!fua) {
> + command = lba48 ? WIN_WRITEDMA_EXT : WIN_WRITEDMA;
> + if (drive->vdma)
> + command = lba48 ? WIN_WRITE_EXT: WIN_WRITE;
> + } else
> + command = ATA_CMD_WRITE_FUA_EXT;
What does happen for fua && drive->vdma case?
> } else {
> command = lba48 ? WIN_READDMA_EXT : WIN_READDMA;
> if (drive->vdma)
> @@ -284,8 +298,20 @@ static ide_startstop_t __ide_do_rw_disk(
> } else {
> if (drive->mult_count) {
> hwif->data_phase = TASKFILE_MULTI_OUT;
> - command = lba48 ? WIN_MULTWRITE_EXT : WIN_MULTWRITE;
> + if (!fua)
> + command = lba48 ? WIN_MULTWRITE_EXT : WIN_MULTWRITE;
> + else
> + command = ATA_CMD_WRITE_MULTI_FUA_EXT;
> } else {
> + if (unlikely(fua)) {
> + /*
> + * This happens if multisector PIO is
> + * turned off during operation.
> + */
> + printk(KERN_ERR "%s: FUA write but in single "
> + "sector PIO mode\n", drive->name);
> + goto fail;
> + }
Wouldn't it be better to do the following check at the beginning
of __ide_do_rw_disk() (after checking for dma vs lba48):
if (fua) {
if (!lba48 || ((!dma || drive->vdma) && !drive->mult_count))
goto fail_fua;
}
...
and fail the request if needed *before* actually touching any
hardware registers?
fail_fua:
printk(KERN_ERR "%s: FUA write unsupported (lba48=%u dma=%u"
" vdma=%u mult_count=%u)\n", drive->name,
lba48, dma, drive->vdma,
drive->mult_count);
ide_end_request(drive, 0, 0);
return ide_stopped;
> hwif->data_phase = TASKFILE_OUT;
> command = lba48 ? WIN_WRITE_EXT : WIN_WRITE;
> }
> @@ -295,6 +321,10 @@ static ide_startstop_t __ide_do_rw_disk(
>
> return pre_task_out_intr(drive, rq);
> }
> +
> + fail:
> + ide_end_request(drive, 0, 0);
> + return ide_stopped;
> }
>
> /*
> @@ -846,7 +876,7 @@ static void idedisk_setup (ide_drive_t *
> {
> struct hd_driveid *id = drive->id;
> unsigned long long capacity;
> - int barrier;
> + int barrier, fua;
>
> idedisk_add_settings(drive);
>
> @@ -967,10 +997,19 @@ static void idedisk_setup (ide_drive_t *
> barrier = 0;
> }
>
> - printk(KERN_INFO "%s: cache flushes %ssupported\n",
> - drive->name, barrier ? "" : "not ");
> + fua = barrier && idedisk_supports_lba48(id) && ide_id_has_fua(id);
> + /* When using PIO, FUA needs multisector. */
> + if ((!drive->using_dma || drive->hwif->no_lba48_dma) &&
> + drive->mult_count == 0)
> + fua = 0;
Shouldn't this check also for drive->vdma?
> +
> + printk(KERN_INFO "%s: cache flushes %ssupported%s\n",
> + drive->name, barrier ? "" : "not ",
> + fua ? " w/ FUA" : "");
> if (barrier) {
> - blk_queue_ordered(drive->queue, QUEUE_ORDERED_DRAIN_FLUSH,
> + unsigned ordered = fua ? QUEUE_ORDERED_DRAIN_FUA
> + : QUEUE_ORDERED_DRAIN_FLUSH;
> + blk_queue_ordered(drive->queue, ordered,
> idedisk_prepare_flush, GFP_KERNEL);
> blk_queue_issue_flush_fn(drive->queue, idedisk_issue_flush);
> } else if (!drive->wcache)
> Index: work/include/linux/hdreg.h
> ===================================================================
> --- work.orig/include/linux/hdreg.h 2005-11-18 00:06:46.000000000 +0900
> +++ work/include/linux/hdreg.h 2005-11-18 00:35:07.000000000 +0900
> @@ -550,7 +550,13 @@ struct hd_driveid {
> * cmd set-feature supported extensions
> * 15: Shall be ZERO
> * 14: Shall be ONE
> - * 13:6 reserved
> + * 13: IDLE IMMEDIATE w/ UNLOAD FEATURE
> + * 12:11 reserved for technical report
> + * 10: URG for WRITE STREAM
> + * 9: URG for READ STREAM
> + * 8: 64-bit World wide name
> + * 7: WRITE DMA QUEUED FUA EXT
> + * 6: WRITE DMA/MULTIPLE FUA EXT
> * 5: General Purpose Logging
> * 4: Streaming Feature Set
> * 3: Media Card Pass Through
> @@ -600,7 +606,13 @@ struct hd_driveid {
> * command set-feature default
> * 15: Shall be ZERO
> * 14: Shall be ONE
> - * 13:6 reserved
> + * 13: IDLE IMMEDIATE w/ UNLOAD FEATURE
> + * 12:11 reserved for technical report
> + * 10: URG for WRITE STREAM
> + * 9: URG for READ STREAM
> + * 8: 64-bit World wide name
> + * 7: WRITE DMA QUEUED FUA EXT
> + * 6: WRITE DMA/MULTIPLE FUA EXT
> * 5: General Purpose Logging enabled
> * 4: Valid CONFIGURE STREAM executed
> * 3: Media Card Pass Through enabled
> Index: work/include/linux/ide.h
> ===================================================================
> --- work.orig/include/linux/ide.h 2005-11-18 00:14:29.000000000 +0900
> +++ work/include/linux/ide.h 2005-11-18 00:35:07.000000000 +0900
> @@ -1503,6 +1503,9 @@ extern struct bus_type ide_bus_type;
> /* check if CACHE FLUSH (EXT) command is supported (bits defined in ATA-6) */
> #define ide_id_has_flush_cache(id) ((id)->cfs_enable_2 & 0x3000)
>
> +/* check if WRITE DMA FUA EXT command is supported (defined in ATA-8) */
> +#define ide_id_has_fua(id) ((id)->cfsse & 0x0040)
> +
> /* some Maxtor disks have bit 13 defined incorrectly so check bit 10 too */
> #define ide_id_has_flush_cache_ext(id) \
> (((id)->cfs_enable_2 & 0x2400) == 0x2400)
>
>
Hello, Bartlomiej.
Bartlomiej Zolnierkiewicz wrote:
> On 11/17/05, Tejun Heo <[email protected]> wrote:
>
> I fail to see how the partial completions (good + bad sectors)
> are done in your new scheme, please explain.
>
It doesn't. I've noted this way back when I posted this patchset the
second time.
http://marc.theaimsgroup.com/?l=linux-kernel&m=111795127124020&w=2
Rationales
* The actual barrier IO request is issued as a part of ordered sequence.
When any part of this sequence fails (any of leading flush, barrier IO
or post flush), the whole sequence should be considered to have failed.
e.g. if leading flush fails, there's no point in reporting partial or
full success of barrier IO. Ditto for tailing flush. We can special
case when only part of barrier IO fails and report partial barrier
success, but 1. benefits are doubtful 2. even if it's implemented, it
wouldn't work (see next rationale)
* Barrier requests are not mergeable. ie. Each barrier bio is turned
into one barrier request and partially completing the request doesn't
result in any successfully completed bio.
* SCSI doesn't handle partial completion of barrier IOs.
>
>>-
>>-static int idedisk_prepare_flush(request_queue_t *q, struct request *rq)
>>-{
>>- ide_drive_t *drive = q->queuedata;
>>-
>>- if (!drive->wcache)
>>- return 0;
>
>
> What does happen if somebody disables drive->wcache later?
>
Thanks for pointing out. I've moved ordered configuration into
write_cache such that ordered is reconfigured when write_cache changes.
There can be in-flight barrier requests which are inconsistent with the
newly updated setting, but 1. it's not too unfair to assume that user is
responsible for that synchronization 2. the original implementation had
the same issue 3. the consequence is not catastrophic.
>
>> memset(rq->cmd, 0, sizeof(rq->cmd));
>>
>>@@ -735,9 +694,8 @@ static int idedisk_prepare_flush(request
>> rq->cmd[0] = WIN_FLUSH_CACHE;
>>
>>
>>- rq->flags |= REQ_DRIVE_TASK | REQ_SOFTBARRIER;
>>+ rq->flags |= REQ_DRIVE_TASK;
>> rq->buffer = rq->cmd;
>>- return 1;
>> }
>>
>> static int idedisk_issue_flush(request_queue_t *q, struct gendisk *disk,
>>@@ -1012,11 +970,12 @@ static void idedisk_setup (ide_drive_t *
>> printk(KERN_INFO "%s: cache flushes %ssupported\n",
>> drive->name, barrier ? "" : "not ");
>> if (barrier) {
>>- blk_queue_ordered(drive->queue, QUEUE_ORDERED_FLUSH);
>>- drive->queue->prepare_flush_fn = idedisk_prepare_flush;
>>- drive->queue->end_flush_fn = idedisk_end_flush;
>>+ blk_queue_ordered(drive->queue, QUEUE_ORDERED_DRAIN_FLUSH,
>>+ idedisk_prepare_flush, GFP_KERNEL);
>> blk_queue_issue_flush_fn(drive->queue, idedisk_issue_flush);
>>- }
>>+ } else if (!drive->wcache)
>>+ blk_queue_ordered(drive->queue, QUEUE_ORDERED_DRAIN,
>>+ NULL, GFP_KERNEL);
>
>
> What does happen if somebody enables drive->wcache later?
>
ditto.
>
>> }
>>
>> static void ide_cacheflush_p(ide_drive_t *drive)
>>@@ -1034,6 +993,8 @@ static int ide_disk_remove(struct device
>> struct ide_disk_obj *idkp = drive->driver_data;
>> struct gendisk *g = idkp->disk;
>>
>>+ blk_queue_ordered(drive->queue, QUEUE_ORDERED_NONE, NULL, 0);
>>+
>
>
> Shouldn't this be done in ide_disk_release()?
Hmmm... The thing is that, AFAIK, requests are not supposed to be issued
after ->remove is called (->remove is called only on module unload
unless hardware is hot-unplugged and HL driver cannot be unloaded while
it's still opened). I think that's why both sd and ide-disk issue the
last cache flush in ->remove callbacks but not in ->release.
I think the most natural place to put ordered deconfiguration is right
above the last cache flush. Hmmm... If above is not true, I think we
should move both ordered deconfig and the last cache flushes to
->release callbacks. What do you think?
Thanks.
--
tejun
Hi, Bartlomiej.
Bartlomiej Zolnierkiewicz wrote:
> On 11/17/05, Tejun Heo <[email protected]> wrote:
>
> What does happen for fua && drive->vdma case?
>
Thanks for pointing out, wasn't thinking about that. Hmmm... When using
vdma, single-sector PIO commands are issued instead but there's no
single-sector FUA PIO command. Would issuing
ATA_CMD_WRITE_MULTI_FUA_EXT instead of ATA_CMD_WRITE_FUA_EXT work? Or
should I just disable FUA on vdma case?
>
>> } else {
>> command = lba48 ? WIN_READDMA_EXT : WIN_READDMA;
>> if (drive->vdma)
>>@@ -284,8 +298,20 @@ static ide_startstop_t __ide_do_rw_disk(
>> } else {
>> if (drive->mult_count) {
>> hwif->data_phase = TASKFILE_MULTI_OUT;
>>- command = lba48 ? WIN_MULTWRITE_EXT : WIN_MULTWRITE;
>>+ if (!fua)
>>+ command = lba48 ? WIN_MULTWRITE_EXT : WIN_MULTWRITE;
>>+ else
>>+ command = ATA_CMD_WRITE_MULTI_FUA_EXT;
>> } else {
>>+ if (unlikely(fua)) {
>>+ /*
>>+ * This happens if multisector PIO is
>>+ * turned off during operation.
>>+ */
>>+ printk(KERN_ERR "%s: FUA write but in single "
>>+ "sector PIO mode\n", drive->name);
>>+ goto fail;
>>+ }
>
>
> Wouldn't it be better to do the following check at the beginning
> of __ide_do_rw_disk() (after checking for dma vs lba48):
>
> if (fua) {
> if (!lba48 || ((!dma || drive->vdma) && !drive->mult_count))
> goto fail_fua;
> }
>
> ...
>
> and fail the request if needed *before* actually touching any
> hardware registers?
>
> fail_fua:
> printk(KERN_ERR "%s: FUA write unsupported (lba48=%u dma=%u"
> " vdma=%u mult_count=%u)\n", drive->name,
> lba48, dma, drive->vdma,
> drive->mult_count);
> ide_end_request(drive, 0, 0);
> return ide_stopped;
>
Hmmm... The thing is that those failure cases will happen extremely
rarely if at all. Remember this post?
http://marc.theaimsgroup.com/?l=linux-kernel&m=111798102108338&w=3
It's mostly guaranteed that those failure cases don't occur, so I
thought avoiding IO on failure case wasn't that helpful.
>
>> hwif->data_phase = TASKFILE_OUT;
>> command = lba48 ? WIN_WRITE_EXT : WIN_WRITE;
>> }
>>@@ -295,6 +321,10 @@ static ide_startstop_t __ide_do_rw_disk(
>>
>> return pre_task_out_intr(drive, rq);
>> }
>>+
>>+ fail:
>>+ ide_end_request(drive, 0, 0);
>>+ return ide_stopped;
>> }
>>
>> /*
>>@@ -846,7 +876,7 @@ static void idedisk_setup (ide_drive_t *
>> {
>> struct hd_driveid *id = drive->id;
>> unsigned long long capacity;
>>- int barrier;
>>+ int barrier, fua;
>>
>> idedisk_add_settings(drive);
>>
>>@@ -967,10 +997,19 @@ static void idedisk_setup (ide_drive_t *
>> barrier = 0;
>> }
>>
>>- printk(KERN_INFO "%s: cache flushes %ssupported\n",
>>- drive->name, barrier ? "" : "not ");
>>+ fua = barrier && idedisk_supports_lba48(id) && ide_id_has_fua(id);
>>+ /* When using PIO, FUA needs multisector. */
>>+ if ((!drive->using_dma || drive->hwif->no_lba48_dma) &&
>>+ drive->mult_count == 0)
>>+ fua = 0;
>
>
> Shouldn't this check also for drive->vdma?
>
Yes, it does. Thanks for pointing out. One question though. FUA
support should be changed if using_dma/mult_count settings are changed.
As using_dma configuration is handled by IDE midlayer, we might need
to add a callback there. What do you think?
--
tejun
Hi,
On 11/18/05, Tejun Heo <[email protected]> wrote:
> Hello, Bartlomiej.
>
> Bartlomiej Zolnierkiewicz wrote:
> > On 11/17/05, Tejun Heo <[email protected]> wrote:
> >
> > I fail to see how the partial completions (good + bad sectors)
> > are done in your new scheme, please explain.
> >
>
> It doesn't. I've noted this way back when I posted this patchset the
> second time.
This should be noted in the patch description not in the announcement.
> http://marc.theaimsgroup.com/?l=linux-kernel&m=111795127124020&w=2
>
> Rationales
>
> * The actual barrier IO request is issued as a part of ordered sequence.
> When any part of this sequence fails (any of leading flush, barrier IO
> or post flush), the whole sequence should be considered to have failed.
>
> e.g. if leading flush fails, there's no point in reporting partial or
> full success of barrier IO. Ditto for tailing flush. We can special
> case when only part of barrier IO fails and report partial barrier
> success, but 1. benefits are doubtful 2. even if it's implemented, it
> wouldn't work (see next rationale)
>
> * Barrier requests are not mergeable. ie. Each barrier bio is turned
> into one barrier request and partially completing the request doesn't
> result in any successfully completed bio.
However your flush request can fail on the sector _completely_
unrelated to the barrier request so in this case old code worked
differently. Anyway I'm fine with this change (previous logic was
too complicated).
> * SCSI doesn't handle partial completion of barrier IOs.
>
> >
> >>-
> >>-static int idedisk_prepare_flush(request_queue_t *q, struct request *rq)
> >>-{
> >>- ide_drive_t *drive = q->queuedata;
> >>-
> >>- if (!drive->wcache)
> >>- return 0;
> >
> >
> > What does happen if somebody disables drive->wcache later?
> >
>
> Thanks for pointing out. I've moved ordered configuration into
> write_cache such that ordered is reconfigured when write_cache changes.
> There can be in-flight barrier requests which are inconsistent with the
> newly updated setting, but 1. it's not too unfair to assume that user is
> responsible for that synchronization 2. the original implementation had
> the same issue 3. the consequence is not catastrophic.
The consequence could be increased number of bugreports about
failed IDE commands which wasn't the case with !drive->wcache check
in place - please leave as it was.
> >> static void ide_cacheflush_p(ide_drive_t *drive)
> >>@@ -1034,6 +993,8 @@ static int ide_disk_remove(struct device
> >> struct ide_disk_obj *idkp = drive->driver_data;
> >> struct gendisk *g = idkp->disk;
> >>
> >>+ blk_queue_ordered(drive->queue, QUEUE_ORDERED_NONE, NULL, 0);
> >>+
> >
> >
> > Shouldn't this be done in ide_disk_release()?
>
> Hmmm... The thing is that, AFAIK, requests are not supposed to be issued
> after ->remove is called (->remove is called only on module unload
> unless hardware is hot-unplugged and HL driver cannot be unloaded while
> it's still opened). I think that's why both sd and ide-disk issue the
> last cache flush in ->remove callbacks but not in ->release.
Are you sure? I think that only calling del_gendisk() assures you
that there won't be outstanding fs requests?
I have also noticed bug in ide_disk_remove() - ide_cacheflush_p()
should be called after del_gendisk() - I will fix it later.
BTW Nowadays you can dynamically dettach/attach driver from/to
device using sysfs interface.
Thanks,
Bartlomiej
On 11/18/05, Tejun Heo <[email protected]> wrote:
> Hi, Bartlomiej.
>
> Bartlomiej Zolnierkiewicz wrote:
> > On 11/17/05, Tejun Heo <[email protected]> wrote:
> >
> > What does happen for fua && drive->vdma case?
> >
>
> Thanks for pointing out, wasn't thinking about that. Hmmm... When using
> vdma, single-sector PIO commands are issued instead but there's no
> single-sector FUA PIO command. Would issuing
> ATA_CMD_WRITE_MULTI_FUA_EXT instead of ATA_CMD_WRITE_FUA_EXT work? Or
> should I just disable FUA on vdma case?
Probably it should work fine given that drive->mult_count is on.
The only controller using drive->vdma in the current tree is cs5520
so you should confirm this with Mark Lord & Alan Cox.
> >
> >> } else {
> >> command = lba48 ? WIN_READDMA_EXT : WIN_READDMA;
> >> if (drive->vdma)
> >>@@ -284,8 +298,20 @@ static ide_startstop_t __ide_do_rw_disk(
> >> } else {
> >> if (drive->mult_count) {
> >> hwif->data_phase = TASKFILE_MULTI_OUT;
> >>- command = lba48 ? WIN_MULTWRITE_EXT : WIN_MULTWRITE;
> >>+ if (!fua)
> >>+ command = lba48 ? WIN_MULTWRITE_EXT : WIN_MULTWRITE;
> >>+ else
> >>+ command = ATA_CMD_WRITE_MULTI_FUA_EXT;
> >> } else {
> >>+ if (unlikely(fua)) {
> >>+ /*
> >>+ * This happens if multisector PIO is
> >>+ * turned off during operation.
> >>+ */
> >>+ printk(KERN_ERR "%s: FUA write but in single "
> >>+ "sector PIO mode\n", drive->name);
> >>+ goto fail;
> >>+ }
> >
> >
> > Wouldn't it be better to do the following check at the beginning
> > of __ide_do_rw_disk() (after checking for dma vs lba48):
> >
> > if (fua) {
> > if (!lba48 || ((!dma || drive->vdma) && !drive->mult_count))
> > goto fail_fua;
> > }
> >
> > ...
> >
> > and fail the request if needed *before* actually touching any
> > hardware registers?
> >
> > fail_fua:
> > printk(KERN_ERR "%s: FUA write unsupported (lba48=%u dma=%u"
> > " vdma=%u mult_count=%u)\n", drive->name,
> > lba48, dma, drive->vdma,
> > drive->mult_count);
> > ide_end_request(drive, 0, 0);
> > return ide_stopped;
> >
>
> Hmmm... The thing is that those failure cases will happen extremely
> rarely if at all. Remember this post?
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=111798102108338&w=3
>
> It's mostly guaranteed that those failure cases don't occur, so I
> thought avoiding IO on failure case wasn't that helpful.
There are two problems with this approach:
* configuration can change dynamically
* locking for configuration changes is flakey
so it can happen that FUA request will slip into __ide_do_rw_disk().
The best way to deal with such case is to kill it early without
touching I/O registers and/or DMA engine and causing big havoc.
> >>@@ -967,10 +997,19 @@ static void idedisk_setup (ide_drive_t *
> >> barrier = 0;
> >> }
> >>
> >>- printk(KERN_INFO "%s: cache flushes %ssupported\n",
> >>- drive->name, barrier ? "" : "not ");
> >>+ fua = barrier && idedisk_supports_lba48(id) && ide_id_has_fua(id);
> >>+ /* When using PIO, FUA needs multisector. */
> >>+ if ((!drive->using_dma || drive->hwif->no_lba48_dma) &&
> >>+ drive->mult_count == 0)
> >>+ fua = 0;
> >
> >
> > Shouldn't this check also for drive->vdma?
> >
>
> Yes, it does. Thanks for pointing out. One question though. FUA
> support should be changed if using_dma/mult_count settings are changed.
> As using_dma configuration is handled by IDE midlayer, we might need
> to add a callback there. What do you think?
It seems it is needed nowadays as there are multiple I/O barrier methods.
Maybe the other alternative is to add ->rq_select_barrier() hook to the
block layer and for each request check what kind of barrier should be
issued (it still won't help for flakey configuration locking but you won't
have to add any callbacks etc).
Long-term we should see if it is possible to remove dynamic IDE
drive configuration and always just use the best possible settings.
Thanks,
Bartlomiej
On Gwe, 2005-11-18 at 17:17 +0100, Bartlomiej Zolnierkiewicz wrote:
> Probably it should work fine given that drive->mult_count is on.
>
> The only controller using drive->vdma in the current tree is cs5520
> so you should confirm this with Mark Lord & Alan Cox.
The CS5520 VDMA performs PIO commands with controller driven DMA to PIO
of the data blocks. Thus it can do any PIO command with one data in or
out phase as if it were DMA.
> Long-term we should see if it is possible to remove dynamic IDE
> drive configuration and always just use the best possible settings.
Hotplug rather prevents that. For the lifetime of a drive connection
most of the settings ought to be settable once. Speeds are the obvious
exception.
On 11/18/05, Alan Cox <[email protected]> wrote:
> On Gwe, 2005-11-18 at 17:17 +0100, Bartlomiej Zolnierkiewicz wrote:
> > Probably it should work fine given that drive->mult_count is on.
> >
> > The only controller using drive->vdma in the current tree is cs5520
> > so you should confirm this with Mark Lord & Alan Cox.
>
> The CS5520 VDMA performs PIO commands with controller driven DMA to PIO
> of the data blocks. Thus it can do any PIO command with one data in or
> out phase as if it were DMA.
Therefore doing ATA_CMD_WRITE_MULTI_FUA_EXT w/ VDMA
should be just fine as is WIN_WRITE_EXT w/ VDMA currently?
> > Long-term we should see if it is possible to remove dynamic IDE
> > drive configuration and always just use the best possible settings.
>
> Hotplug rather prevents that. For the lifetime of a drive connection
> most of the settings ought to be settable once. Speeds are the obvious
> exception.
I'm talking only about lifetime of a drive connection here
so I think that we fully agree on this one.
Thanks,
Bartlomiej
On Gwe, 2005-11-18 at 17:38 +0100, Bartlomiej Zolnierkiewicz wrote:
> On 11/18/05, Alan Cox <[email protected]> wrote:
> > On Gwe, 2005-11-18 at 17:17 +0100, Bartlomiej Zolnierkiewicz wrote:
> > > Probably it should work fine given that drive->mult_count is on.
> > >
> > > The only controller using drive->vdma in the current tree is cs5520
> > > so you should confirm this with Mark Lord & Alan Cox.
> >
> > The CS5520 VDMA performs PIO commands with controller driven DMA to PIO
> > of the data blocks. Thus it can do any PIO command with one data in or
> > out phase as if it were DMA.
>
> Therefore doing ATA_CMD_WRITE_MULTI_FUA_EXT w/ VDMA
> should be just fine as is WIN_WRITE_EXT w/ VDMA currently?
As I understand it yes
On Fri, Nov 18, 2005 at 04:59:28PM +0100, Bartlomiej Zolnierkiewicz wrote:
> Hi,
>
> On 11/18/05, Tejun Heo <[email protected]> wrote:
> > Hello, Bartlomiej.
> >
> > Bartlomiej Zolnierkiewicz wrote:
> > > On 11/17/05, Tejun Heo <[email protected]> wrote:
> > >
> > > I fail to see how the partial completions (good + bad sectors)
> > > are done in your new scheme, please explain.
> > >
> >
> > It doesn't. I've noted this way back when I posted this patchset the
> > second time.
>
> This should be noted in the patch description not in the announcement.
Will add it to patch description.
>
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=111795127124020&w=2
> >
> > Rationales
> >
> > * The actual barrier IO request is issued as a part of ordered sequence.
> > When any part of this sequence fails (any of leading flush, barrier IO
> > or post flush), the whole sequence should be considered to have failed.
> >
> > e.g. if leading flush fails, there's no point in reporting partial or
> > full success of barrier IO. Ditto for tailing flush. We can special
> > case when only part of barrier IO fails and report partial barrier
> > success, but 1. benefits are doubtful 2. even if it's implemented, it
> > wouldn't work (see next rationale)
> >
> > * Barrier requests are not mergeable. ie. Each barrier bio is turned
> > into one barrier request and partially completing the request doesn't
> > result in any successfully completed bio.
>
> However your flush request can fail on the sector _completely_
> unrelated to the barrier request so in this case old code worked
> differently. Anyway I'm fine with this change (previous logic was
> too complicated).
Hmmm... Ordered sequence should fail even when flush request fails on
a completely unrelated to the barrier request. The barrier request
must make sure that all requests issued prior to it have made to the
physical medium successfully.
Anyways, you're right in that it acts differently from the original
code and it should be noted in the patch description. I'll update the
patch description.
>
> > * SCSI doesn't handle partial completion of barrier IOs.
> >
> > >
> > >>-
> > >>-static int idedisk_prepare_flush(request_queue_t *q, struct request *rq)
> > >>-{
> > >>- ide_drive_t *drive = q->queuedata;
> > >>-
> > >>- if (!drive->wcache)
> > >>- return 0;
> > >
> > >
> > > What does happen if somebody disables drive->wcache later?
> > >
> >
> > Thanks for pointing out. I've moved ordered configuration into
> > write_cache such that ordered is reconfigured when write_cache changes.
> > There can be in-flight barrier requests which are inconsistent with the
> > newly updated setting, but 1. it's not too unfair to assume that user is
> > responsible for that synchronization 2. the original implementation had
> > the same issue 3. the consequence is not catastrophic.
>
> The consequence could be increased number of bugreports about
> failed IDE commands which wasn't the case with !drive->wcache check
> in place - please leave as it was.
Ordered requests are processed in the following order.
1. barrier bio reaches blk queue
2. barrier req queued in order
3. when barrier req reaches the head of the request queue, it gets
interpreted into preflush-barrier-postflush requests sequence
and queued. ->prepare_flush_fn is called in this step.
4. When all three requests complete, the ordered sequence ends.
Adding !drive->wcache test to idedisk_prepare_flush, which in turn
requires adding ->prepare_flush_fn error handling to blk ordered
handling, prevents flushes for barrier requests between step#1 and
step#3. We can still have flush reqeuests between #3 and #4 after
wcache is turned off.
Please also note that any of above happens only if a user turns off
->wcache setting while a fs is actively performing a barrier.
I'm not sure the benefit justifies added complexity. Do you still
think adding ->wcache test is necessary?
>
> > >> static void ide_cacheflush_p(ide_drive_t *drive)
> > >>@@ -1034,6 +993,8 @@ static int ide_disk_remove(struct device
> > >> struct ide_disk_obj *idkp = drive->driver_data;
> > >> struct gendisk *g = idkp->disk;
> > >>
> > >>+ blk_queue_ordered(drive->queue, QUEUE_ORDERED_NONE, NULL, 0);
> > >>+
> > >
> > >
> > > Shouldn't this be done in ide_disk_release()?
> >
> > Hmmm... The thing is that, AFAIK, requests are not supposed to be issued
> > after ->remove is called (->remove is called only on module unload
> > unless hardware is hot-unplugged and HL driver cannot be unloaded while
> > it's still opened). I think that's why both sd and ide-disk issue the
> > last cache flush in ->remove callbacks but not in ->release.
>
> Are you sure? I think that only calling del_gendisk() assures you
> that there won't be outstanding fs requests?
>
> I have also noticed bug in ide_disk_remove() - ide_cacheflush_p()
> should be called after del_gendisk() - I will fix it later.
>
> BTW Nowadays you can dynamically dettach/attach driver from/to
> device using sysfs interface.
I agree that it should go into ->release, but I am still a bit scared
about issuing commands in ->release (it might access some data
structure which might be gone by then). Also, the correct order seems
to be 'turning off ordered' and then 'perform the last cache flush'.
So, how about adding blk_queue_ordered right above the last
ide_cacheflush_p() now and move both to ->release in a separate patch
for both IDE and SCSI?
Thanks.
--
tejun
On 11/22/05, Tejun Heo <[email protected]> wrote:
> On Fri, Nov 18, 2005 at 04:59:28PM +0100, Bartlomiej Zolnierkiewicz wrote:
> > > http://marc.theaimsgroup.com/?l=linux-kernel&m=111795127124020&w=2
> > >
> > > Rationales
> > >
> > > * The actual barrier IO request is issued as a part of ordered sequence.
> > > When any part of this sequence fails (any of leading flush, barrier IO
> > > or post flush), the whole sequence should be considered to have failed.
> > >
> > > e.g. if leading flush fails, there's no point in reporting partial or
> > > full success of barrier IO. Ditto for tailing flush. We can special
> > > case when only part of barrier IO fails and report partial barrier
> > > success, but 1. benefits are doubtful 2. even if it's implemented, it
> > > wouldn't work (see next rationale)
> > >
> > > * Barrier requests are not mergeable. ie. Each barrier bio is turned
> > > into one barrier request and partially completing the request doesn't
> > > result in any successfully completed bio.
> >
> > However your flush request can fail on the sector _completely_
> > unrelated to the barrier request so in this case old code worked
> > differently. Anyway I'm fine with this change (previous logic was
> > too complicated).
>
> Hmmm... Ordered sequence should fail even when flush request fails on
> a completely unrelated to the barrier request. The barrier request
> must make sure that all requests issued prior to it have made to the
> physical medium successfully.
>
> Anyways, you're right in that it acts differently from the original
> code and it should be noted in the patch description. I'll update the
> patch description.
Thanks.
> > > * SCSI doesn't handle partial completion of barrier IOs.
> > >
> > > >
> > > >>-
> > > >>-static int idedisk_prepare_flush(request_queue_t *q, struct request *rq)
> > > >>-{
> > > >>- ide_drive_t *drive = q->queuedata;
> > > >>-
> > > >>- if (!drive->wcache)
> > > >>- return 0;
> > > >
> > > >
> > > > What does happen if somebody disables drive->wcache later?
> > > >
> > >
> > > Thanks for pointing out. I've moved ordered configuration into
> > > write_cache such that ordered is reconfigured when write_cache changes.
> > > There can be in-flight barrier requests which are inconsistent with the
> > > newly updated setting, but 1. it's not too unfair to assume that user is
> > > responsible for that synchronization 2. the original implementation had
> > > the same issue 3. the consequence is not catastrophic.
> >
> > The consequence could be increased number of bugreports about
> > failed IDE commands which wasn't the case with !drive->wcache check
> > in place - please leave as it was.
>
> Ordered requests are processed in the following order.
>
> 1. barrier bio reaches blk queue
>
> 2. barrier req queued in order
>
> 3. when barrier req reaches the head of the request queue, it gets
> interpreted into preflush-barrier-postflush requests sequence
> and queued. ->prepare_flush_fn is called in this step.
>
> 4. When all three requests complete, the ordered sequence ends.
>
> Adding !drive->wcache test to idedisk_prepare_flush, which in turn
> requires adding ->prepare_flush_fn error handling to blk ordered
> handling, prevents flushes for barrier requests between step#1 and
Why for !drive->wcache flush can't be consider as successful
like it was before these changes...
> step#3. We can still have flush reqeuests between #3 and #4 after
> wcache is turned off.
ditto
> Please also note that any of above happens only if a user turns off
> ->wcache setting while a fs is actively performing a barrier.
>
> I'm not sure the benefit justifies added complexity. Do you still
> think adding ->wcache test is necessary?
>
> >
> > > >> static void ide_cacheflush_p(ide_drive_t *drive)
> > > >>@@ -1034,6 +993,8 @@ static int ide_disk_remove(struct device
> > > >> struct ide_disk_obj *idkp = drive->driver_data;
> > > >> struct gendisk *g = idkp->disk;
> > > >>
> > > >>+ blk_queue_ordered(drive->queue, QUEUE_ORDERED_NONE, NULL, 0);
> > > >>+
> > > >
> > > >
> > > > Shouldn't this be done in ide_disk_release()?
> > >
> > > Hmmm... The thing is that, AFAIK, requests are not supposed to be issued
> > > after ->remove is called (->remove is called only on module unload
> > > unless hardware is hot-unplugged and HL driver cannot be unloaded while
> > > it's still opened). I think that's why both sd and ide-disk issue the
> > > last cache flush in ->remove callbacks but not in ->release.
> >
> > Are you sure? I think that only calling del_gendisk() assures you
> > that there won't be outstanding fs requests?
> >
> > I have also noticed bug in ide_disk_remove() - ide_cacheflush_p()
> > should be called after del_gendisk() - I will fix it later.
> >
> > BTW Nowadays you can dynamically dettach/attach driver from/to
> > device using sysfs interface.
>
> I agree that it should go into ->release, but I am still a bit scared
> about issuing commands in ->release (it might access some data
> structure which might be gone by then). Also, the correct order seems
> to be 'turning off ordered' and then 'perform the last cache flush'.
> So, how about adding blk_queue_ordered right above the last
> ide_cacheflush_p() now and move both to ->release in a separate patch
> for both IDE and SCSI?
Not needed, when ide-disk is fixed to call del_gendisk() after
ide_cacheflush_p(), we can add blk_queue_orderer() before
the latter and then everything should be OK.
Thanks,
Bartlomiej
thinko
On 11/22/05, Bartlomiej Zolnierkiewicz <[email protected]> wrote:
> > > > >> static void ide_cacheflush_p(ide_drive_t *drive)
> > > > >>@@ -1034,6 +993,8 @@ static int ide_disk_remove(struct device
> > > > >> struct ide_disk_obj *idkp = drive->driver_data;
> > > > >> struct gendisk *g = idkp->disk;
> > > > >>
> > > > >>+ blk_queue_ordered(drive->queue, QUEUE_ORDERED_NONE, NULL, 0);
> > > > >>+
> > > > >
> > > > >
> > > > > Shouldn't this be done in ide_disk_release()?
> > > >
> > > > Hmmm... The thing is that, AFAIK, requests are not supposed to be issued
> > > > after ->remove is called (->remove is called only on module unload
> > > > unless hardware is hot-unplugged and HL driver cannot be unloaded while
> > > > it's still opened). I think that's why both sd and ide-disk issue the
> > > > last cache flush in ->remove callbacks but not in ->release.
> > >
> > > Are you sure? I think that only calling del_gendisk() assures you
> > > that there won't be outstanding fs requests?
> > >
> > > I have also noticed bug in ide_disk_remove() - ide_cacheflush_p()
> > > should be called after del_gendisk() - I will fix it later.
> > >
> > > BTW Nowadays you can dynamically dettach/attach driver from/to
> > > device using sysfs interface.
> >
> > I agree that it should go into ->release, but I am still a bit scared
> > about issuing commands in ->release (it might access some data
> > structure which might be gone by then). Also, the correct order seems
> > to be 'turning off ordered' and then 'perform the last cache flush'.
> > So, how about adding blk_queue_ordered right above the last
> > ide_cacheflush_p() now and move both to ->release in a separate patch
> > for both IDE and SCSI?
>
> Not needed, when ide-disk is fixed to call del_gendisk() after
> ide_cacheflush_p(), we can add blk_queue_orderer() before
ide_cacheflush_p() after del_gendisk()
> the latter and then everything should be OK.
blk_queue_ordererd() before ide_cache_flush_p()
:)
On Tue, Nov 22, 2005 at 09:36:09AM +0100, Bartlomiej Zolnierkiewicz wrote:
[--snip--]
> >
> > Ordered requests are processed in the following order.
> >
> > 1. barrier bio reaches blk queue
> >
> > 2. barrier req queued in order
> >
> > 3. when barrier req reaches the head of the request queue, it gets
> > interpreted into preflush-barrier-postflush requests sequence
> > and queued. ->prepare_flush_fn is called in this step.
> >
> > 4. When all three requests complete, the ordered sequence ends.
> >
> > Adding !drive->wcache test to idedisk_prepare_flush, which in turn
> > requires adding ->prepare_flush_fn error handling to blk ordered
> > handling, prevents flushes for barrier requests between step#1 and
>
> Why for !drive->wcache flush can't be consider as successful
> like it was before these changes...
>
> > step#3. We can still have flush reqeuests between #3 and #4 after
> > wcache is turned off.
>
> ditto
>
I think we have two alternatives here - both have some problems.
1. make ->prepare_flush_fn return some code to tell blk layer skip
the flush as the original code did.
This is what you're proposing, I guess. The reason why I'm reluctant
to take this approach is that there still remains window of error
between #3 and #4. The flush requests could already be prepared and
in the queue when ->wcache is turned off. AFAICS, the original code
had the same problem, although the window was narrower.
2. complete flush commands successfully in execute_drive_cmd() if wcache
is not enabled.
This approach fixes all cases but the implementation would be a bit
hackish. execute_drive_cmd() will have to look at the command and
ide_disk->wcache to determine if a special command should be completed
successfully without actually executing it. Maybe we can add a
per-HL-driver callback for that.
Bartlomiej, I'm not really sure about both of above approaches. My
humble opinion is that benefits brought by both of above aren't worth
the added complexity. The worst can happen is a few IDE command
failures caused by executing flush command on a wcache disabled drive.
And that would happen only when the user turns off wcache while
barrier sequence is *in progress*.
Hmmm... What do you think? It's your call. I'll do what you choose.
[--snip--]
> > I agree that it should go into ->release, but I am still a bit scared
> > about issuing commands in ->release (it might access some data
> > structure which might be gone by then). Also, the correct order seems
> > to be 'turning off ordered' and then 'perform the last cache flush'.
> > So, how about adding blk_queue_ordered right above the last
> > ide_cacheflush_p() now and move both to ->release in a separate patch
> > for both IDE and SCSI?
>
> Not needed, when ide-disk is fixed to call del_gendisk() after
> ide_cacheflush_p(), we can add blk_queue_orderer() before
> the latter and then everything should be OK.
Can you get the patch into the post-2.6.15 branch of linux-2.6-block
tree?
Thanks. :-)
--
tejun
On 11/23/05, Tejun Heo <[email protected]> wrote:
> On Tue, Nov 22, 2005 at 09:36:09AM +0100, Bartlomiej Zolnierkiewicz wrote:
> [--snip--]
> > >
> > > Ordered requests are processed in the following order.
> > >
> > > 1. barrier bio reaches blk queue
> > >
> > > 2. barrier req queued in order
> > >
> > > 3. when barrier req reaches the head of the request queue, it gets
> > > interpreted into preflush-barrier-postflush requests sequence
> > > and queued. ->prepare_flush_fn is called in this step.
> > >
> > > 4. When all three requests complete, the ordered sequence ends.
> > >
> > > Adding !drive->wcache test to idedisk_prepare_flush, which in turn
> > > requires adding ->prepare_flush_fn error handling to blk ordered
> > > handling, prevents flushes for barrier requests between step#1 and
> >
> > Why for !drive->wcache flush can't be consider as successful
> > like it was before these changes...
> >
> > > step#3. We can still have flush reqeuests between #3 and #4 after
> > > wcache is turned off.
> >
> > ditto
> >
>
> I think we have two alternatives here - both have some problems.
>
> 1. make ->prepare_flush_fn return some code to tell blk layer skip
> the flush as the original code did.
>
> This is what you're proposing, I guess. The reason why I'm reluctant
> to take this approach is that there still remains window of error
> between #3 and #4. The flush requests could already be prepared and
> in the queue when ->wcache is turned off. AFAICS, the original code
> had the same problem, although the window was narrower.
>
> 2. complete flush commands successfully in execute_drive_cmd() if wcache
> is not enabled.
>
> This approach fixes all cases but the implementation would be a bit
> hackish. execute_drive_cmd() will have to look at the command and
> ide_disk->wcache to determine if a special command should be completed
> successfully without actually executing it. Maybe we can add a
> per-HL-driver callback for that.
>
> Bartlomiej, I'm not really sure about both of above approaches. My
> humble opinion is that benefits brought by both of above aren't worth
> the added complexity. The worst can happen is a few IDE command
> failures caused by executing flush command on a wcache disabled drive.
> And that would happen only when the user turns off wcache while
> barrier sequence is *in progress*.
>
> Hmmm... What do you think? It's your call. I'll do what you choose.
Hmm... both solutions sucks. After second thought I agree with you
w.r.t to original changes (just remember to document them in the patch
description).
Bartlomiej
On Wed, Nov 23 2005, Bartlomiej Zolnierkiewicz wrote:
> On 11/23/05, Tejun Heo <[email protected]> wrote:
> > On Tue, Nov 22, 2005 at 09:36:09AM +0100, Bartlomiej Zolnierkiewicz wrote:
> > [--snip--]
> > > >
> > > > Ordered requests are processed in the following order.
> > > >
> > > > 1. barrier bio reaches blk queue
> > > >
> > > > 2. barrier req queued in order
> > > >
> > > > 3. when barrier req reaches the head of the request queue, it gets
> > > > interpreted into preflush-barrier-postflush requests sequence
> > > > and queued. ->prepare_flush_fn is called in this step.
> > > >
> > > > 4. When all three requests complete, the ordered sequence ends.
> > > >
> > > > Adding !drive->wcache test to idedisk_prepare_flush, which in turn
> > > > requires adding ->prepare_flush_fn error handling to blk ordered
> > > > handling, prevents flushes for barrier requests between step#1 and
> > >
> > > Why for !drive->wcache flush can't be consider as successful
> > > like it was before these changes...
> > >
> > > > step#3. We can still have flush reqeuests between #3 and #4 after
> > > > wcache is turned off.
> > >
> > > ditto
> > >
> >
> > I think we have two alternatives here - both have some problems.
> >
> > 1. make ->prepare_flush_fn return some code to tell blk layer skip
> > the flush as the original code did.
> >
> > This is what you're proposing, I guess. The reason why I'm reluctant
> > to take this approach is that there still remains window of error
> > between #3 and #4. The flush requests could already be prepared and
> > in the queue when ->wcache is turned off. AFAICS, the original code
> > had the same problem, although the window was narrower.
> >
> > 2. complete flush commands successfully in execute_drive_cmd() if wcache
> > is not enabled.
> >
> > This approach fixes all cases but the implementation would be a bit
> > hackish. execute_drive_cmd() will have to look at the command and
> > ide_disk->wcache to determine if a special command should be completed
> > successfully without actually executing it. Maybe we can add a
> > per-HL-driver callback for that.
> >
> > Bartlomiej, I'm not really sure about both of above approaches. My
> > humble opinion is that benefits brought by both of above aren't worth
> > the added complexity. The worst can happen is a few IDE command
> > failures caused by executing flush command on a wcache disabled drive.
> > And that would happen only when the user turns off wcache while
> > barrier sequence is *in progress*.
> >
> > Hmmm... What do you think? It's your call. I'll do what you choose.
>
> Hmm... both solutions sucks. After second thought I agree with you
> w.r.t to original changes (just remember to document them in the patch
> description).
Me too. Plus on most drives a flush cache command on a drive with write
back caching disabled will succeed, not fail. t13 is about to mandate
that behaviour as well, and honestly I think this is the logical way to
implement it in a drive (the flush just becomes a noop).
So Tejun, where do we stand with this patch series? Any changes over
what you posted last week? I'd like to get this merged in the block
tree, get it in -mm and merged for 2.6.16 if things work out.
--
Jens Axboe
Jens Axboe wrote:
> On Wed, Nov 23 2005, Bartlomiej Zolnierkiewicz wrote:
>
>>On 11/23/05, Tejun Heo <[email protected]> wrote:
>>
>>>On Tue, Nov 22, 2005 at 09:36:09AM +0100, Bartlomiej Zolnierkiewicz wrote:
>>>[--snip--]
>>>
>>>>>Ordered requests are processed in the following order.
>>>>>
>>>>>1. barrier bio reaches blk queue
>>>>>
>>>>>2. barrier req queued in order
>>>>>
>>>>>3. when barrier req reaches the head of the request queue, it gets
>>>>> interpreted into preflush-barrier-postflush requests sequence
>>>>> and queued. ->prepare_flush_fn is called in this step.
>>>>>
>>>>>4. When all three requests complete, the ordered sequence ends.
>>>>>
>>>>>Adding !drive->wcache test to idedisk_prepare_flush, which in turn
>>>>>requires adding ->prepare_flush_fn error handling to blk ordered
>>>>>handling, prevents flushes for barrier requests between step#1 and
>>>>
>>>>Why for !drive->wcache flush can't be consider as successful
>>>>like it was before these changes...
>>>>
>>>>
>>>>>step#3. We can still have flush reqeuests between #3 and #4 after
>>>>>wcache is turned off.
>>>>
>>>>ditto
>>>>
>>>
>>>I think we have two alternatives here - both have some problems.
>>>
>>>1. make ->prepare_flush_fn return some code to tell blk layer skip
>>> the flush as the original code did.
>>>
>>>This is what you're proposing, I guess. The reason why I'm reluctant
>>>to take this approach is that there still remains window of error
>>>between #3 and #4. The flush requests could already be prepared and
>>>in the queue when ->wcache is turned off. AFAICS, the original code
>>>had the same problem, although the window was narrower.
>>>
>>>2. complete flush commands successfully in execute_drive_cmd() if wcache
>>> is not enabled.
>>>
>>>This approach fixes all cases but the implementation would be a bit
>>>hackish. execute_drive_cmd() will have to look at the command and
>>>ide_disk->wcache to determine if a special command should be completed
>>>successfully without actually executing it. Maybe we can add a
>>>per-HL-driver callback for that.
>>>
>>>Bartlomiej, I'm not really sure about both of above approaches. My
>>>humble opinion is that benefits brought by both of above aren't worth
>>>the added complexity. The worst can happen is a few IDE command
>>>failures caused by executing flush command on a wcache disabled drive.
>>>And that would happen only when the user turns off wcache while
>>>barrier sequence is *in progress*.
>>>
>>>Hmmm... What do you think? It's your call. I'll do what you choose.
>>
>>Hmm... both solutions sucks. After second thought I agree with you
>>w.r.t to original changes (just remember to document them in the patch
>>description).
>
Yeap, both suck. Glad to hear that. :-)
>
> Me too. Plus on most drives a flush cache command on a drive with write
> back caching disabled will succeed, not fail. t13 is about to mandate
> that behaviour as well, and honestly I think this is the logical way to
> implement it in a drive (the flush just becomes a noop).
>
> So Tejun, where do we stand with this patch series? Any changes over
> what you posted last week? I'd like to get this merged in the block
> tree, get it in -mm and merged for 2.6.16 if things work out.
>
Hi, Bartlomiej. Hi, Jens.
Currently, there are only two more things to do before getting this
thing into the -mm tree.
1. Adjusting this patch (update-ide) after merging Bartlomiej's
del_gendisk() fix patch.
2. Update ide-fua patch as discussed and get it reviewed by Bartlomiej.
Other than above two, I think we're ready. I'll post update ide-fua
patch later today.
Thanks.
--
tejun
On 11/23/05, Tejun <[email protected]> wrote:
<...>
> Hi, Bartlomiej. Hi, Jens.
>
> Currently, there are only two more things to do before getting this
> thing into the -mm tree.
>
> 1. Adjusting this patch (update-ide) after merging Bartlomiej's
> del_gendisk() fix patch.
This should be in the mainline today/tomorrow as it is obviously correct
and I need to push another IDE update anyway (new HW support).
> 2. Update ide-fua patch as discussed and get it reviewed by Bartlomiej.
>
> Other than above two, I think we're ready. I'll post update ide-fua
> patch later today.
>
> Thanks.
>
> --
> tejun
Hi, Bartlomiej.
Bartlomiej Zolnierkiewicz wrote:
> On 11/18/05, Tejun Heo <[email protected]> wrote:
>
>>Hi, Bartlomiej.
>>
>>Bartlomiej Zolnierkiewicz wrote:
>>
>>>On 11/17/05, Tejun Heo <[email protected]> wrote:
>>>
>>>What does happen for fua && drive->vdma case?
>>>
>>
>>Thanks for pointing out, wasn't thinking about that. Hmmm... When using
>>vdma, single-sector PIO commands are issued instead but there's no
>>single-sector FUA PIO command. Would issuing
>>ATA_CMD_WRITE_MULTI_FUA_EXT instead of ATA_CMD_WRITE_FUA_EXT work? Or
>>should I just disable FUA on vdma case?
>
>
> Probably it should work fine given that drive->mult_count is on.
>
> The only controller using drive->vdma in the current tree is cs5520
> so you should confirm this with Mark Lord & Alan Cox.
>
This is done without too much trouble.
>
>>>> } else {
>>>> command = lba48 ? WIN_READDMA_EXT : WIN_READDMA;
>>>> if (drive->vdma)
>>>>@@ -284,8 +298,20 @@ static ide_startstop_t __ide_do_rw_disk(
>>>> } else {
>>>> if (drive->mult_count) {
>>>> hwif->data_phase = TASKFILE_MULTI_OUT;
>>>>- command = lba48 ? WIN_MULTWRITE_EXT : WIN_MULTWRITE;
>>>>+ if (!fua)
>>>>+ command = lba48 ? WIN_MULTWRITE_EXT : WIN_MULTWRITE;
>>>>+ else
>>>>+ command = ATA_CMD_WRITE_MULTI_FUA_EXT;
>>>> } else {
>>>>+ if (unlikely(fua)) {
>>>>+ /*
>>>>+ * This happens if multisector PIO is
>>>>+ * turned off during operation.
>>>>+ */
>>>>+ printk(KERN_ERR "%s: FUA write but in single "
>>>>+ "sector PIO mode\n", drive->name);
>>>>+ goto fail;
>>>>+ }
>>>
>>>
>>>Wouldn't it be better to do the following check at the beginning
>>>of __ide_do_rw_disk() (after checking for dma vs lba48):
>>>
>>> if (fua) {
>>> if (!lba48 || ((!dma || drive->vdma) && !drive->mult_count))
>>> goto fail_fua;
>>> }
>>>
>>>...
>>>
>>>and fail the request if needed *before* actually touching any
>>>hardware registers?
>>>
>>>fail_fua:
>>> printk(KERN_ERR "%s: FUA write unsupported (lba48=%u dma=%u"
>>> " vdma=%u mult_count=%u)\n", drive->name,
>>> lba48, dma, drive->vdma,
>>>drive->mult_count);
>>> ide_end_request(drive, 0, 0);
>>> return ide_stopped;
>>>
>>
>>Hmmm... The thing is that those failure cases will happen extremely
>>rarely if at all. Remember this post?
>>
>>http://marc.theaimsgroup.com/?l=linux-kernel&m=111798102108338&w=3
>>
>>It's mostly guaranteed that those failure cases don't occur, so I
>>thought avoiding IO on failure case wasn't that helpful.
>
>
> There are two problems with this approach:
>
> * configuration can change dynamically
> * locking for configuration changes is flakey
>
> so it can happen that FUA request will slip into __ide_do_rw_disk().
>
> The best way to deal with such case is to kill it early without
> touching I/O registers and/or DMA engine and causing big havoc.
>
I've taken your advice and added the test at the top.
>
>>>>@@ -967,10 +997,19 @@ static void idedisk_setup (ide_drive_t *
>>>> barrier = 0;
>>>> }
>>>>
>>>>- printk(KERN_INFO "%s: cache flushes %ssupported\n",
>>>>- drive->name, barrier ? "" : "not ");
>>>>+ fua = barrier && idedisk_supports_lba48(id) && ide_id_has_fua(id);
>>>>+ /* When using PIO, FUA needs multisector. */
>>>>+ if ((!drive->using_dma || drive->hwif->no_lba48_dma) &&
>>>>+ drive->mult_count == 0)
>>>>+ fua = 0;
>>>
>>>
>>>Shouldn't this check also for drive->vdma?
>>>
>>
>>Yes, it does. Thanks for pointing out. One question though. FUA
>>support should be changed if using_dma/mult_count settings are changed.
>> As using_dma configuration is handled by IDE midlayer, we might need
>>to add a callback there. What do you think?
>
>
> It seems it is needed nowadays as there are multiple I/O barrier methods.
>
> Maybe the other alternative is to add ->rq_select_barrier() hook to the
> block layer and for each request check what kind of barrier should be
> issued (it still won't help for flakey configuration locking but you won't
> have to add any callbacks etc).
>
> Long-term we should see if it is possible to remove dynamic IDE
> drive configuration and always just use the best possible settings.
Well, this one is quite a pain in the ass.
I'm not very fond of ->rq_select_barrier() approach for the following
reasons.
* That removes possibility of correct synchronization. With
blk_queue_ordered() approach, we can later add
blk_queue_[un]lock_ordered() to achieve correct synchronization if that
becomes necessary, but with ->rq_select_barrier() approach, the
low-level driver ends up having less control over what's gonna happen when.
* Changing ordered mode is not supposed to be a frequent operation and
the blk_queue_ordered() interface makes that explicit.
So, I added ide_driver_t->protocol_changed() callback which gets called
whenever dma/multimode changes occur. Unfortunately, dma/multmode
changes can be committed with or without context, and with or without
queue lock. As blk_queue_ordered uses the queue lock for
synchronization, this becomes issue.
I tried to distinguish places where the changes occur while queue lock
is held from the other. Not only was it highly error-prone, it couldn't
be done without modifying/auditing all low-level drivers as some drivers
(cs5520) use the same function which touches dma setting
(cs5520_tune_chipset) from both ->speedproc (called with queuelock) and
->ide_dma_check (called without queuelock).
One alternative I'm thinking of is using a workqueue to call
blk_queue_ordered, such that we don't have to guess whether or not we're
called with queuelock held. Unfortunately, this will give us a small
window where wrong barrier requests can hit the drive.
Bartlomiej, any ideas?
Jens, as this one seems to need some time to settle, I'm gonna post
updated patchset for post-2.6.15 without ide-fua patch, so that the
other stuff can be pushed into -mm. I think we can live without ide-fua
for a while. :-)
Thanks.
--
tejun
Oops, I was delusional again.
Tejun Heo wrote:
>
> Well, this one is quite a pain in the ass.
>
> I'm not very fond of ->rq_select_barrier() approach for the following
> reasons.
>
> * That removes possibility of correct synchronization. With
> blk_queue_ordered() approach, we can later add
> blk_queue_[un]lock_ordered() to achieve correct synchronization if that
> becomes necessary, but with ->rq_select_barrier() approach, the
> low-level driver ends up having less control over what's gonna happen when.
Of course, we can do the same lock/unlock dance with
->rq_select_barrier() approach. I wasn't thinking straight. Forget
this rationale.
>
> * Changing ordered mode is not supposed to be a frequent operation and
> the blk_queue_ordered() interface makes that explicit.
>
> So, I added ide_driver_t->protocol_changed() callback which gets called
> whenever dma/multimode changes occur. Unfortunately, dma/multmode
> changes can be committed with or without context, and with or without
> queue lock. As blk_queue_ordered uses the queue lock for
> synchronization, this becomes issue.
>
> I tried to distinguish places where the changes occur while queue lock
> is held from the other. Not only was it highly error-prone, it couldn't
> be done without modifying/auditing all low-level drivers as some drivers
> (cs5520) use the same function which touches dma setting
> (cs5520_tune_chipset) from both ->speedproc (called with queuelock) and
> ->ide_dma_check (called without queuelock).
>
> One alternative I'm thinking of is using a workqueue to call
> blk_queue_ordered, such that we don't have to guess whether or not we're
> called with queuelock held. Unfortunately, this will give us a small
> window where wrong barrier requests can hit the drive.
One thing I wanna add here is that using ->rq_select_barrier() would
have similar race window. The race windows is just hidden there in the
request queue.
>
> Bartlomiej, any ideas?
>
> Jens, as this one seems to need some time to settle, I'm gonna post
> updated patchset for post-2.6.15 without ide-fua patch, so that the
> other stuff can be pushed into -mm. I think we can live without ide-fua
> for a while. :-)
>
> Thanks.
>
--
tejun
On 11/24/05, Tejun Heo <[email protected]> wrote:
>
> Oops, I was delusional again.
>
> Tejun Heo wrote:
> >
> > Well, this one is quite a pain in the ass.
> >
> > I'm not very fond of ->rq_select_barrier() approach for the following
> > reasons.
> >
> > * That removes possibility of correct synchronization. With
> > blk_queue_ordered() approach, we can later add
> > blk_queue_[un]lock_ordered() to achieve correct synchronization if that
> > becomes necessary, but with ->rq_select_barrier() approach, the
> > low-level driver ends up having less control over what's gonna happen when.
>
> Of course, we can do the same lock/unlock dance with
> ->rq_select_barrier() approach. I wasn't thinking straight. Forget
> this rationale.
>
> >
> > * Changing ordered mode is not supposed to be a frequent operation and
> > the blk_queue_ordered() interface makes that explicit.
> >
> > So, I added ide_driver_t->protocol_changed() callback which gets called
> > whenever dma/multimode changes occur. Unfortunately, dma/multmode
> > changes can be committed with or without context, and with or without
> > queue lock. As blk_queue_ordered uses the queue lock for
> > synchronization, this becomes issue.
> >
> > I tried to distinguish places where the changes occur while queue lock
> > is held from the other. Not only was it highly error-prone, it couldn't
> > be done without modifying/auditing all low-level drivers as some drivers
> > (cs5520) use the same function which touches dma setting
> > (cs5520_tune_chipset) from both ->speedproc (called with queuelock) and
> > ->ide_dma_check (called without queuelock).
> >
> > One alternative I'm thinking of is using a workqueue to call
> > blk_queue_ordered, such that we don't have to guess whether or not we're
> > called with queuelock held. Unfortunately, this will give us a small
> > window where wrong barrier requests can hit the drive.
>
> One thing I wanna add here is that using ->rq_select_barrier() would
> have similar race window. The race windows is just hidden there in the
> request queue.
>
> >
> > Bartlomiej, any ideas?
Wouldn't the whole thing get magically fixed with conversion of
IDE settings to use REQ_DRIVER and making them pass through
request queue (old idea everybody seems to like but nobody wants
to implement :)?
> > Jens, as this one seems to need some time to settle, I'm gonna post
> > updated patchset for post-2.6.15 without ide-fua patch, so that the
> > other stuff can be pushed into -mm. I think we can live without ide-fua
> > for a while. :-)
You can as well post updated ide-fua patch.
I would like to see it, it may be OK to put it in -mm for now.
Thanks,
Bartlomiej