2003-02-05 15:09:50

by Jens Axboe

[permalink] [raw]
Subject: [PATCH] ide write barriers

Hi,

The attached patch implements write barrier operations in the block
layer and for IDE, specifically. The goal is to make the use of write
back cache enabled ide drives safe with journalled file systems.

Patch is against 2.4.21-pre4-bk as of today, and includes a small patch
to enable it on ext3. Chris has a patch for reiserfs as well.

===== drivers/block/elevator.c 1.7 vs edited =====
--- 1.7/drivers/block/elevator.c Mon Sep 16 09:36:31 2002
+++ edited/drivers/block/elevator.c Wed Feb 5 13:44:38 2003
@@ -156,6 +156,12 @@
while ((entry = entry->prev) != head) {
struct request *__rq = blkdev_entry_to_request(entry);

+ /*
+ * we can neither merge nor insert before/with a flush
+ */
+ if (__rq->cmd_flags & RQ_WRITE_ORDERED)
+ break;
+
if (__rq->cmd != rw)
continue;
if (__rq->rq_dev != bh->b_rdev)
===== drivers/block/ll_rw_blk.c 1.42 vs edited =====
--- 1.42/drivers/block/ll_rw_blk.c Mon Dec 16 07:22:15 2002
+++ edited/drivers/block/ll_rw_blk.c Wed Feb 5 13:44:38 2003
@@ -240,6 +240,32 @@
void blk_queue_make_request(request_queue_t * q, make_request_fn * mfn)
{
q->make_request_fn = mfn;
+ q->ordered = QUEUE_ORDERED_NONE;
+}
+
+/**
+ * blk_queue_ordered - does this queue support ordered writes
+ * @q: the request queue
+ * @flag: see below
+ *
+ * Description:
+ * For journalled file systems, doing ordered writes on a commit
+ * block instead of explicitly doing wait_on_buffer (which is bad
+ * for performance) can be a big win. Block drivers supporting this
+ * feature should call this function and indicate so.
+ *
+ * SCSI drivers usually need to support ordered tags, while others
+ * may have to do a complete drive cache flush if they are using write
+ * back caching (or not and lying about it)
+ *
+ * With this in mind, the values are
+ * QUEUE_ORDERED_NONE: the default, doesn't support barrier
+ * QUEUE_ORDERED_TAG: supports ordered tags
+ * QUEUE_ORDERED_FLUSH: supports barrier through cache flush
+ **/
+void blk_queue_ordered(request_queue_t *q, int flag)
+{
+ q->ordered = flag;
}

/**
@@ -517,6 +543,7 @@
rq = blkdev_free_rq(&rl->free);
list_del(&rq->queue);
rl->count--;
+ rq->cmd_flags = 0;
rq->rq_status = RQ_ACTIVE;
rq->cmd = rw;
rq->special = NULL;
@@ -908,12 +935,27 @@
int rw_ahead, max_sectors, el_ret;
struct list_head *head, *insert_here;
int latency;
+ int write_ordered = 0;
elevator_t *elevator = &q->elevator;

+ /* check for barrier requests the device can't handle */
+ if (buffer_ordered_tag(bh))
+ write_ordered = QUEUE_ORDERED_TAG;
+ else if (buffer_ordered_flush(bh))
+ write_ordered = QUEUE_ORDERED_FLUSH;
+
+ if (write_ordered && q->ordered != write_ordered) {
+ if (buffer_ordered_hard(bh)) {
+ set_bit(BH_IO_OPNOTSUPP, &bh->b_state);
+ goto end_io;
+ }
+ write_ordered = 0;
+ }
+
count = bh->b_size >> 9;
sector = bh->b_rsector;

- rw_ahead = 0; /* normal case; gets changed below for READA */
+ latency = rw_ahead = 0; /* normal case; gets changed below for READA */
switch (rw) {
case READA:
#if 0 /* bread() misinterprets failed READA attempts as IO errors on SMP */
@@ -922,7 +964,8 @@
rw = READ; /* drop into READ */
case READ:
case WRITE:
- latency = elevator_request_latency(elevator, rw);
+ if (!write_ordered)
+ latency = elevator_request_latency(elevator, rw);
break;
default:
BUG();
@@ -1049,6 +1092,9 @@
}

/* fill up the request-info, and add it to the queue */
+ if (write_ordered)
+ req->cmd_flags |= RQ_WRITE_ORDERED;
+
req->elevator_sequence = latency;
req->cmd = rw;
req->errors = 0;
@@ -1519,3 +1565,4 @@
EXPORT_SYMBOL(blk_max_pfn);
EXPORT_SYMBOL(blk_seg_merge_ok);
EXPORT_SYMBOL(blk_nohighio);
+EXPORT_SYMBOL(blk_queue_ordered);
===== drivers/ide/ide-disk.c 1.12 vs edited =====
--- 1.12/drivers/ide/ide-disk.c Mon Sep 30 18:19:50 2002
+++ edited/drivers/ide/ide-disk.c Wed Feb 5 15:57:43 2003
@@ -766,32 +766,7 @@

static int idedisk_end_request (ide_drive_t *drive, int uptodate)
{
- struct request *rq;
- unsigned long flags;
- int ret = 1;
-
- spin_lock_irqsave(&io_request_lock, flags);
- rq = HWGROUP(drive)->rq;
-
- /*
- * decide whether to reenable DMA -- 3 is a random magic for now,
- * if we DMA timeout more than 3 times, just stay in PIO
- */
- if (drive->state == DMA_PIO_RETRY && drive->retry_pio <= 3) {
- drive->state = 0;
- HWGROUP(drive)->hwif->ide_dma_on(drive);
- }
-
- if (!end_that_request_first(rq, uptodate, drive->name)) {
- add_blkdev_randomness(MAJOR(rq->rq_dev));
- blkdev_dequeue_request(rq);
- HWGROUP(drive)->rq = NULL;
- end_that_request_last(rq);
- ret = 0;
- }
-
- spin_unlock_irqrestore(&io_request_lock, flags);
- return ret;
+ return ide_end_request(drive, uptodate);
}

static u8 idedisk_dump_status (ide_drive_t *drive, const char *msg, u8 stat)
===== drivers/ide/ide-io.c 1.2 vs edited =====
--- 1.2/drivers/ide/ide-io.c Thu Jan 30 18:28:59 2003
+++ edited/drivers/ide/ide-io.c Wed Feb 5 16:15:33 2003
@@ -58,8 +58,8 @@

#if (DISK_RECOVERY_TIME > 0)

-Error So the User Has To Fix the Compilation And Stop Hacking Port 0x43
-Does anyone ever use this anyway ??
+#error So the User Has To Fix the Compilation And Stop Hacking Port 0x43
+#error Does anyone ever use this anyway ??

/*
* For really screwy hardware (hey, at least it *can* be used with Linux)
@@ -88,6 +88,38 @@
#endif /* DISK_RECOVERY_TIME */
}

+ /*
+ * preempt pending requests, and store this cache flush for immediate
+ * execution
+ */
+static struct request *ide_queue_flush_cmd(ide_drive_t *drive,
+ struct request *rq, int post)
+{
+ struct request *flush_rq = &HWGROUP(drive)->wrq;
+
+ list_del_init(&rq->queue);
+
+ memset(drive->special_buf, 0, sizeof(drive->special_buf));
+
+ ide_init_drive_cmd(flush_rq);
+
+ flush_rq->buffer = drive->special_buf;
+ flush_rq->special = rq;
+ flush_rq->buffer[0] = WIN_FLUSH_CACHE;
+
+ if (drive->id->cfs_enable_2 & 0x2400)
+ flush_rq->buffer[0] = WIN_FLUSH_CACHE_EXT;
+
+ if (!post) {
+ drive->doing_barrier = 1;
+ flush_rq->cmd_flags |= RQ_WRITE_PREFLUSH;
+ } else
+ flush_rq->cmd_flags |= RQ_WRITE_POSTFLUSH;
+
+ list_add(&flush_rq->queue, &drive->queue.queue_head);
+ return flush_rq;
+}
+
/*
* ide_end_request - complete an IDE I/O
* @drive: IDE device for the I/O
@@ -117,10 +149,20 @@

if (!end_that_request_first(rq, uptodate, drive->name)) {
add_blkdev_randomness(MAJOR(rq->rq_dev));
- blkdev_dequeue_request(rq);
HWGROUP(drive)->rq = NULL;
- end_that_request_last(rq);
ret = 0;
+
+ /*
+ * if this is a write barrier, flush the writecache before
+ * signalling completion of this request.
+ */
+ if (rq->cmd_flags & RQ_WRITE_ORDERED)
+ ide_queue_flush_cmd(drive, rq, 1);
+ else {
+ blkdev_dequeue_request(rq);
+ end_that_request_last(rq);
+ }
+
}

spin_unlock_irqrestore(&io_request_lock, flags);
@@ -220,6 +262,35 @@
}
spin_lock_irqsave(&io_request_lock, flags);
blkdev_dequeue_request(rq);
+
+ /*
+ * if a cache flush fails, disable ordered write support
+ */
+ if (rq->cmd_flags & (RQ_WRITE_PREFLUSH | RQ_WRITE_POSTFLUSH)) {
+ struct request *real_rq = rq->special;
+
+ /*
+ * best-effort currently, this ignores the fact that there
+ * may be other barriers currently queued that we can't
+ * honor any more
+ */
+ if (err) {
+ printk("%s: cache flushing failed. disable write back cacheing for journalled file systems\n", drive->name);
+ blk_queue_ordered(&drive->queue, QUEUE_ORDERED_NONE);
+ }
+
+ if (rq->cmd_flags & RQ_WRITE_POSTFLUSH) {
+ drive->doing_barrier = 0;
+ end_that_request_last(real_rq);
+ } else {
+ /*
+ * just indicate that we did the pre flush
+ */
+ real_rq->cmd_flags |= RQ_WRITE_PREFLUSH;
+ list_add(&real_rq->queue, &drive->queue.queue_head);
+ }
+ }
+
HWGROUP(drive)->rq = NULL;
end_that_request_last(rq);
spin_unlock_irqrestore(&io_request_lock, flags);
@@ -277,8 +348,11 @@
struct request *rq;
u8 err;

+ if (drive == NULL)
+ return ide_stopped;
+
err = ide_dump_status(drive, msg, stat);
- if (drive == NULL || (rq = HWGROUP(drive)->rq) == NULL)
+ if ((rq = HWGROUP(drive)->rq) == NULL)
return ide_stopped;

hwif = HWIF(drive);
@@ -682,6 +756,15 @@
repeat:
best = NULL;
drive = hwgroup->drive;
+
+ /*
+ * drive is doing pre-flush, ordered write, post-flush sequence. even
+ * though that is 3 requests, it must be seen as a single transaction.
+ * we must no preempt this drive until that is complete
+ */
+ if (drive->doing_barrier)
+ return drive;
+
do {
if (!blk_queue_empty(&drive->queue) && (!drive->sleep || time_after_eq(jiffies, drive->sleep))) {
if (!best
@@ -824,7 +907,18 @@
printk(KERN_ERR "%s: Huh? nuking plugged queue\n", drive->name);

rq = blkdev_entry_next_request(&drive->queue.queue_head);
+
+ /*
+ * if rq is a barrier write, issue pre cache flush if not
+ * already done
+ */
+ if (rq->cmd_flags & RQ_WRITE_ORDERED) {
+ if (!(rq->cmd_flags & RQ_WRITE_PREFLUSH))
+ rq = ide_queue_flush_cmd(drive, rq, 0);
+ }
+
hwgroup->rq = rq;
+
/*
* Some systems have trouble with IDE IRQs arriving while
* the driver is still setting things up. So, here we disable
===== drivers/ide/ide-probe.c 1.14 vs edited =====
--- 1.14/drivers/ide/ide-probe.c Mon Jan 27 17:44:05 2003
+++ edited/drivers/ide/ide-probe.c Wed Feb 5 16:17:25 2003
@@ -784,11 +784,8 @@
q->queuedata = HWGROUP(drive);
blk_init_queue(q, do_ide_request);

- if (drive->media == ide_disk) {
-#ifdef CONFIG_BLK_DEV_ELEVATOR_NOOP
- elevator_init(&q->elevator, ELEVATOR_NOOP);
-#endif
- }
+ if (drive->media == ide_disk)
+ blk_queue_ordered(&drive->queue, QUEUE_ORDERED_FLUSH);
}

#undef __IRQ_HELL_SPIN
===== fs/jbd/commit.c 1.5 vs edited =====
--- 1.5/fs/jbd/commit.c Thu Sep 26 14:25:37 2002
+++ edited/fs/jbd/commit.c Wed Feb 5 13:44:38 2003
@@ -598,7 +598,15 @@
struct buffer_head *bh = jh2bh(descriptor);
clear_bit(BH_Dirty, &bh->b_state);
bh->b_end_io = journal_end_buffer_io_sync;
+
+ /* if we're on an ide device, setting BH_Ordered_Flush
+ will force a write cache flush before and after the
+ commit block. Otherwise, it'll do nothing. */
+
+ set_bit(BH_Ordered_Flush, &bh->b_state);
submit_bh(WRITE, bh);
+ clear_bit(BH_Ordered_Flush, &bh->b_state);
+
wait_on_buffer(bh);
put_bh(bh); /* One for getblk() */
journal_unlock_journal_head(descriptor);
===== include/linux/blkdev.h 1.23 vs edited =====
--- 1.23/include/linux/blkdev.h Fri Nov 29 23:03:01 2002
+++ edited/include/linux/blkdev.h Wed Feb 5 13:54:09 2003
@@ -32,6 +32,7 @@

kdev_t rq_dev;
int cmd; /* READ or WRITE */
+ unsigned long cmd_flags;
int errors;
unsigned long start_time;
unsigned long sector;
@@ -48,6 +49,10 @@
request_queue_t *q;
};

+#define RQ_WRITE_ORDERED 1 /* ordered write */
+#define RQ_WRITE_PREFLUSH 2 /* pre-barrier flush */
+#define RQ_WRITE_POSTFLUSH 4 /* post-barrier flush */
+
#include <linux/elevator.h>

typedef int (merge_request_fn) (request_queue_t *q,
@@ -127,6 +132,10 @@
char head_active;

unsigned long bounce_pfn;
+ /*
+ * ordered write support
+ */
+ char ordered;

/*
* Is meant to protect the queue in the future instead of
@@ -156,6 +165,9 @@
}
}

+#define QUEUE_ORDERED_NONE 0 /* no support */
+#define QUEUE_ORDERED_TAG 1 /* supported by tags (fast) */
+#define QUEUE_ORDERED_FLUSH 2 /* supported by cache flush (ugh!) */
extern unsigned long blk_max_low_pfn, blk_max_pfn;

#define BLK_BOUNCE_HIGH (blk_max_low_pfn << PAGE_SHIFT)
@@ -225,6 +237,7 @@
extern void blk_init_queue(request_queue_t *, request_fn_proc *);
extern void blk_cleanup_queue(request_queue_t *);
extern void blk_queue_headactive(request_queue_t *, int);
+extern void blk_queue_ordered(request_queue_t *, int);
extern void blk_queue_make_request(request_queue_t *, make_request_fn *);
extern void generic_unplug_device(void *);
extern inline int blk_seg_merge_ok(struct buffer_head *, struct buffer_head *);
===== include/linux/fs.h 1.74 vs edited =====
--- 1.74/include/linux/fs.h Sat Jan 4 04:09:16 2003
+++ edited/include/linux/fs.h Wed Feb 5 13:53:34 2003
@@ -221,6 +221,10 @@
BH_Launder, /* 1 if we can throttle on this buffer */
BH_Attached, /* 1 if b_inode_buffers is linked into a list */
BH_JBD, /* 1 if it has an attached journal_head */
+ BH_Ordered_Tag, /* 1 if this buffer is a ordered write barrier */
+ BH_Ordered_Flush,/* 1 if this buffer is a flush write barrier */
+ BH_Ordered_Hard, /* 1 if barrier required by the caller */
+ BH_IO_OPNOTSUPP,/* 1 if block layer rejected a barrier write */

BH_PrivateStart,/* not a state bit, but the first bit available
* for private allocation by other entities
@@ -283,7 +287,10 @@
#define buffer_new(bh) __buffer_state(bh,New)
#define buffer_async(bh) __buffer_state(bh,Async)
#define buffer_launder(bh) __buffer_state(bh,Launder)
-
+#define buffer_ordered_tag(bh) __buffer_state(bh,Ordered_Tag)
+#define buffer_ordered_hard(bh) __buffer_state(bh,Ordered_Hard)
+#define buffer_ordered_flush(bh) __buffer_state(bh,Ordered_Flush)
+
#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)

extern void set_bh_page(struct buffer_head *bh, struct page *page, unsigned long offset);
===== include/linux/ide.h 1.9 vs edited =====
--- 1.9/include/linux/ide.h Fri Jan 31 14:22:15 2003
+++ edited/include/linux/ide.h Wed Feb 5 16:14:15 2003
@@ -743,6 +743,7 @@
unsigned remap_0_to_1 : 2; /* 0=remap if ezdrive, 1=remap, 2=noremap */
unsigned ata_flash : 1; /* 1=present, 0=default */
unsigned dead : 1; /* 1=dead, no new attachments */
+ unsigned doing_barrier : 1; /* state, 1=currently doing flush */
unsigned addressing; /* : 3;
* 0=28-bit
* 1=48-bit
@@ -788,6 +789,8 @@
int forced_lun; /* if hdxlun was given at boot */
int lun; /* logical unit */
int crc_count; /* crc counter to reduce drive speed */
+
+ char special_buf[4]; /* IDE_DRIVE_CMD, free use */
} ide_drive_t;

typedef struct ide_pio_ops_s {

--
Jens Axboe


2003-02-05 15:59:51

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: [PATCH] ide write barriers

On Wednesday 05 February 2003 16:18, Jens Axboe wrote:

Hi Jens,

> The attached patch implements write barrier operations in the block
> layer and for IDE, specifically. The goal is to make the use of write
> back cache enabled ide drives safe with journalled file systems.
> Patch is against 2.4.21-pre4-bk as of today, and includes a small patch
> to enable it on ext3. Chris has a patch for reiserfs as well.
Could you also please cook up one for 2.4.20? :) Thank you.

ciao, Marc

2003-02-05 16:24:43

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] ide write barriers

On Wed, Feb 05 2003, Marc-Christian Petersen wrote:
> On Wednesday 05 February 2003 16:18, Jens Axboe wrote:
>
> Hi Jens,
>
> > The attached patch implements write barrier operations in the block
> > layer and for IDE, specifically. The goal is to make the use of write
> > back cache enabled ide drives safe with journalled file systems.
> > Patch is against 2.4.21-pre4-bk as of today, and includes a small patch
> > to enable it on ext3. Chris has a patch for reiserfs as well.
> Could you also please cook up one for 2.4.20? :) Thank you.

Sure, I had that one already. BTW, I discovered that the default io
scheduler forgets to honor the cmd_flags, it's supposed to break like
the noop does (see very first hunk in very first file). Must have
removed that by mistake some time ago... This applies both to the
2.4.21-pre4 patch posted and this one.

diff -urN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.4.20/drivers/block/elevator.c linux/drivers/block/elevator.c
--- /opt/kernel/linux-2.4.20/drivers/block/elevator.c 2002-11-29 00:53:12.000000000 +0100
+++ linux/drivers/block/elevator.c 2002-11-19 07:58:11.000000000 +0100
@@ -156,6 +156,12 @@
while ((entry = entry->prev) != head) {
struct request *__rq = blkdev_entry_to_request(entry);

+ /*
+ * we can neither merge nor insert before/with a flush
+ */
+ if (__rq->cmd_flags & RQ_WRITE_ORDERED)
+ break;
+
if (__rq->cmd != rw)
continue;
if (__rq->rq_dev != bh->b_rdev)
diff -urN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.4.20/drivers/block/ll_rw_blk.c linux/drivers/block/ll_rw_blk.c
--- /opt/kernel/linux-2.4.20/drivers/block/ll_rw_blk.c 2002-11-29 00:53:12.000000000 +0100
+++ linux/drivers/block/ll_rw_blk.c 2002-11-22 13:53:31.000000000 +0100
@@ -240,6 +240,32 @@
void blk_queue_make_request(request_queue_t * q, make_request_fn * mfn)
{
q->make_request_fn = mfn;
+ q->ordered = QUEUE_ORDERED_NONE;
+}
+
+/**
+ * blk_queue_ordered - does this queue support ordered writes
+ * @q: the request queue
+ * @flag: see below
+ *
+ * Description:
+ * For journalled file systems, doing ordered writes on a commit
+ * block instead of explicitly doing wait_on_buffer (which is bad
+ * for performance) can be a big win. Block drivers supporting this
+ * feature should call this function and indicate so.
+ *
+ * SCSI drivers usually need to support ordered tags, while others
+ * may have to do a complete drive cache flush if they are using write
+ * back caching (or not and lying about it)
+ *
+ * With this in mind, the values are
+ * QUEUE_ORDERED_NONE: the default, doesn't support barrier
+ * QUEUE_ORDERED_TAG: supports ordered tags
+ * QUEUE_ORDERED_FLUSH: supports barrier through cache flush
+ **/
+void blk_queue_ordered(request_queue_t *q, int flag)
+{
+ q->ordered = flag;
}

/**
@@ -432,7 +458,7 @@

si_meminfo(&si);
megs = si.totalram >> (20 - PAGE_SHIFT);
- nr_requests = 128;
+ nr_requests = 16;
if (megs < 32)
nr_requests /= 2;
blk_grow_request_list(q, nr_requests);
@@ -517,6 +543,7 @@
rq = blkdev_free_rq(&rl->free);
list_del(&rq->queue);
rl->count--;
+ rq->cmd_flags = 0;
rq->rq_status = RQ_ACTIVE;
rq->cmd = rw;
rq->special = NULL;
@@ -908,12 +935,27 @@
int rw_ahead, max_sectors, el_ret;
struct list_head *head, *insert_here;
int latency;
+ int write_ordered = 0;
elevator_t *elevator = &q->elevator;

+ /* check for barrier requests the device can't handle */
+ if (buffer_ordered_tag(bh))
+ write_ordered = QUEUE_ORDERED_TAG;
+ else if (buffer_ordered_flush(bh))
+ write_ordered = QUEUE_ORDERED_FLUSH;
+
+ if (write_ordered && q->ordered != write_ordered) {
+ if (buffer_ordered_hard(bh)) {
+ set_bit(BH_IO_OPNOTSUPP, &bh->b_state);
+ goto end_io;
+ }
+ write_ordered = 0;
+ }
+
count = bh->b_size >> 9;
sector = bh->b_rsector;

- rw_ahead = 0; /* normal case; gets changed below for READA */
+ latency = rw_ahead = 0; /* normal case; gets changed below for READA */
switch (rw) {
case READA:
#if 0 /* bread() misinterprets failed READA attempts as IO errors on SMP */
@@ -922,7 +964,8 @@
rw = READ; /* drop into READ */
case READ:
case WRITE:
- latency = elevator_request_latency(elevator, rw);
+ if (!write_ordered)
+ latency = elevator_request_latency(elevator, rw);
break;
default:
BUG();
@@ -1049,6 +1092,9 @@
}

/* fill up the request-info, and add it to the queue */
+ if (write_ordered)
+ req->cmd_flags |= RQ_WRITE_ORDERED;
+
req->elevator_sequence = latency;
req->cmd = rw;
req->errors = 0;
@@ -1525,3 +1571,4 @@
EXPORT_SYMBOL(blk_max_pfn);
EXPORT_SYMBOL(blk_seg_merge_ok);
EXPORT_SYMBOL(blk_nohighio);
+EXPORT_SYMBOL(blk_queue_ordered);
diff -urN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.4.20/drivers/ide/ide.c linux/drivers/ide/ide.c
--- /opt/kernel/linux-2.4.20/drivers/ide/ide.c 2002-11-29 00:53:13.000000000 +0100
+++ linux/drivers/ide/ide.c 2002-11-19 07:58:11.000000000 +0100
@@ -555,6 +555,36 @@
}

/*
+ * preempt pending requests, and store this cache flush for immediate
+ * execution
+ */
+static struct request *ide_queue_flush_cmd(ide_drive_t *drive, struct request *rq, int post)
+{
+ struct request *flush_rq = &HWGROUP(drive)->wrq;
+
+ list_del(&rq->queue);
+
+ memset(drive->special_buf, 0, sizeof(drive->special_buf));
+
+ ide_init_drive_cmd(flush_rq);
+
+ flush_rq->buffer = drive->special_buf;
+ flush_rq->special = rq;
+
+ flush_rq->buffer[0] = (drive->id->cfs_enable_2 & 0x2400) ? WIN_FLUSH_CACHE_EXT : WIN_FLUSH_CACHE;
+
+ if (post)
+ flush_rq->cmd_flags |= RQ_WRITE_POSTFLUSH;
+ else {
+ drive->doing_barrier = 1;
+ flush_rq->cmd_flags |= RQ_WRITE_PREFLUSH;
+ }
+
+ list_add(&flush_rq->queue, &drive->queue.queue_head);
+ return flush_rq;
+}
+
+/*
* This is our end_request replacement function.
*/
void ide_end_request (byte uptodate, ide_hwgroup_t *hwgroup)
@@ -577,9 +607,19 @@

if (!end_that_request_first(rq, uptodate, hwgroup->drive->name)) {
add_blkdev_randomness(MAJOR(rq->rq_dev));
- blkdev_dequeue_request(rq);
hwgroup->rq = NULL;
- end_that_request_last(rq);
+
+ /*
+ * if this is a write barrier, flush the writecache before
+ * allowing new requests to finsh and before signalling
+ * completion of this request
+ */
+ if (rq->cmd_flags & RQ_WRITE_ORDERED)
+ ide_queue_flush_cmd(drive, rq, 1);
+ else {
+ blkdev_dequeue_request(rq);
+ end_that_request_last(rq);
+ }
}
spin_unlock_irqrestore(&io_request_lock, flags);
}
@@ -932,8 +972,36 @@
default:
break;
}
+
spin_lock_irqsave(&io_request_lock, flags);
blkdev_dequeue_request(rq);
+
+ /*
+ * if a cache flush fails, disable ordered write support
+ */
+ if (rq->cmd_flags & (RQ_WRITE_PREFLUSH | RQ_WRITE_POSTFLUSH)) {
+ struct request *real_rq = rq->special;
+
+ /*
+ * best-effort currently, this ignores the fact that there
+ * may be other barriers currently queued that we can't
+ * honor any more
+ */
+ if (err)
+ blk_queue_ordered(&drive->queue, QUEUE_ORDERED_NONE);
+
+ if (rq->cmd_flags & RQ_WRITE_POSTFLUSH) {
+ drive->doing_barrier = 0;
+ end_that_request_last(real_rq);
+ } else {
+ /*
+ * just indicate that we did the pre flush
+ */
+ real_rq->cmd_flags |= RQ_WRITE_PREFLUSH;
+ list_add(&real_rq->queue, &drive->queue.queue_head);
+ }
+ }
+
HWGROUP(drive)->rq = NULL;
end_that_request_last(rq);
spin_unlock_irqrestore(&io_request_lock, flags);
@@ -947,6 +1015,13 @@
unsigned long flags;
byte err = 0;

+ if (drive->quiet) {
+ if ((stat & (BUSY_STAT|ERR_STAT)) == ERR_STAT)
+ err = GET_ERR();
+
+ return err;
+ }
+
__save_flags (flags); /* local CPU only */
ide__sti(); /* local CPU only */
printk("%s: %s: status=0x%02x", drive->name, msg, stat);
@@ -1049,9 +1124,14 @@
struct request *rq;
byte err;

+ if (drive == NULL)
+ return ide_stopped;
+
err = ide_dump_status(drive, msg, stat);
- if (drive == NULL || (rq = HWGROUP(drive)->rq) == NULL)
+
+ if ((rq = HWGROUP(drive)->rq) == NULL)
return ide_stopped;
+
/* retry only "normal" I/O: */
if (rq->cmd == IDE_DRIVE_CMD || rq->cmd == IDE_DRIVE_TASK) {
rq->errors = 1;
@@ -1454,6 +1534,15 @@
repeat:
best = NULL;
drive = hwgroup->drive;
+
+ /*
+ * drive is doing pre-flush, ordered write, post-flush sequence. even
+ * though that is 3 requests, it must be seen as a single transaction.
+ * we must no preempt this drive until that is complete
+ */
+ if (drive->doing_barrier)
+ return drive;
+
do {
if (!list_empty(&drive->queue.queue_head) && (!drive->sleep || 0 <= (signed long)(jiffies - drive->sleep))) {
if (!best
@@ -1583,7 +1672,18 @@
if ( drive->queue.plugged ) /* paranoia */
printk("%s: Huh? nuking plugged queue\n", drive->name);

- rq = hwgroup->rq = blkdev_entry_next_request(&drive->queue.queue_head);
+ rq = blkdev_entry_next_request(&drive->queue.queue_head);
+
+ /*
+ * if rq is a barrier write, issue pre cache flush if not
+ * already done
+ */
+ if ((rq->cmd_flags & RQ_WRITE_ORDERED)
+ && !(rq->cmd_flags & RQ_WRITE_PREFLUSH))
+ rq = ide_queue_flush_cmd(drive, rq, 0);
+
+ hwgroup->rq = rq;
+
/*
* Some systems have trouble with IDE IRQs arriving while
* the driver is still setting things up. So, here we disable
@@ -3868,6 +3968,14 @@
drive->dsc_overlap = (drive->next != drive && driver->supports_dsc_overlap);
drive->nice1 = 1;
}
+ if (DRIVER(drive)->flushcache && drive->media == ide_disk) {
+ drive->quiet = 1;
+ if (!DRIVER(drive)->flushcache(drive)) {
+ blk_queue_ordered(&drive->queue, QUEUE_ORDERED_FLUSH);
+ printk("%s: safely enabled flush\n", drive->name);
+ }
+ drive->quiet = 0;
+ }
drive->revalidate = 1;
drive->suspend_reset = 0;
#ifdef CONFIG_PROC_FS
diff -urN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.4.20/fs/jbd/commit.c linux/fs/jbd/commit.c
--- /opt/kernel/linux-2.4.20/fs/jbd/commit.c 2002-11-29 00:53:15.000000000 +0100
+++ linux/fs/jbd/commit.c 2002-11-22 12:01:29.000000000 +0100
@@ -598,7 +598,15 @@
struct buffer_head *bh = jh2bh(descriptor);
clear_bit(BH_Dirty, &bh->b_state);
bh->b_end_io = journal_end_buffer_io_sync;
+
+ /* if we're on an ide device, setting BH_Ordered_Flush
+ will force a write cache flush before and after the
+ commit block. Otherwise, it'll do nothing. */
+
+ set_bit(BH_Ordered_Flush, &bh->b_state);
submit_bh(WRITE, bh);
+ clear_bit(BH_Ordered_Flush, &bh->b_state);
+
wait_on_buffer(bh);
put_bh(bh); /* One for getblk() */
journal_unlock_journal_head(descriptor);
diff -urN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.4.20/include/linux/blkdev.h linux/include/linux/blkdev.h
--- /opt/kernel/linux-2.4.20/include/linux/blkdev.h 2002-11-29 00:53:15.000000000 +0100
+++ linux/include/linux/blkdev.h 2002-11-26 17:33:57.000000000 +0100
@@ -32,6 +32,7 @@

kdev_t rq_dev;
int cmd; /* READ or WRITE */
+ unsigned long cmd_flags;
int errors;
unsigned long start_time;
unsigned long sector;
@@ -48,6 +49,10 @@
request_queue_t *q;
};

+#define RQ_WRITE_ORDERED 1 /* ordered write */
+#define RQ_WRITE_PREFLUSH 2 /* pre-barrier flush */
+#define RQ_WRITE_POSTFLUSH 4 /* post-barrier flush */
+
#include <linux/elevator.h>

typedef int (merge_request_fn) (request_queue_t *q,
@@ -127,6 +132,10 @@
char head_active;

unsigned long bounce_pfn;
+ /*
+ * ordered write support
+ */
+ char ordered;

/*
* Is meant to protect the queue in the future instead of
@@ -140,6 +149,9 @@
wait_queue_head_t wait_for_requests[2];
};

+#define QUEUE_ORDERED_NONE 0 /* no support */
+#define QUEUE_ORDERED_TAG 1 /* supported by tags (fast) */
+#define QUEUE_ORDERED_FLUSH 2 /* supported by cache flush (ugh!) */
extern unsigned long blk_max_low_pfn, blk_max_pfn;

#define BLK_BOUNCE_HIGH (blk_max_low_pfn << PAGE_SHIFT)
@@ -209,6 +221,7 @@
extern void blk_init_queue(request_queue_t *, request_fn_proc *);
extern void blk_cleanup_queue(request_queue_t *);
extern void blk_queue_headactive(request_queue_t *, int);
+extern void blk_queue_ordered(request_queue_t *, int);
extern void blk_queue_make_request(request_queue_t *, make_request_fn *);
extern void generic_unplug_device(void *);
extern inline int blk_seg_merge_ok(struct buffer_head *, struct buffer_head *);
diff -urN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.4.20/include/linux/elevator.h linux/include/linux/elevator.h
--- /opt/kernel/linux-2.4.20/include/linux/elevator.h 2002-11-29 00:53:15.000000000 +0100
+++ linux/include/linux/elevator.h 2002-11-22 13:55:07.000000000 +0100
@@ -93,8 +93,8 @@

#define ELEVATOR_LINUS \
((elevator_t) { \
- 2048, /* read passovers */ \
- 8192, /* write passovers */ \
+ 256, /* read passovers */ \
+ 1024, /* write passovers */ \
\
elevator_linus_merge, /* elevator_merge_fn */ \
elevator_linus_merge_req, /* elevator_merge_req_fn */ \
diff -urN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.4.20/include/linux/fs.h linux/include/linux/fs.h
--- /opt/kernel/linux-2.4.20/include/linux/fs.h 2002-11-29 00:53:15.000000000 +0100
+++ linux/include/linux/fs.h 2002-11-22 11:30:56.000000000 +0100
@@ -220,6 +220,10 @@
BH_Wait_IO, /* 1 if we should write out this buffer */
BH_Launder, /* 1 if we can throttle on this buffer */
BH_JBD, /* 1 if it has an attached journal_head */
+ BH_Ordered_Tag, /* 1 if this buffer is a ordered write barrier */
+ BH_Ordered_Flush,/* 1 if this buffer is a flush write barrier */
+ BH_Ordered_Hard, /* 1 if barrier required by the caller */
+ BH_IO_OPNOTSUPP,/* 1 if block layer rejected a barrier write */

BH_PrivateStart,/* not a state bit, but the first bit available
* for private allocation by other entities
@@ -283,7 +287,10 @@
#define buffer_new(bh) __buffer_state(bh,New)
#define buffer_async(bh) __buffer_state(bh,Async)
#define buffer_launder(bh) __buffer_state(bh,Launder)
-
+#define buffer_ordered_tag(bh) __buffer_state(bh,Ordered_Tag)
+#define buffer_ordered_hard(bh) __buffer_state(bh,Ordered_Hard)
+#define buffer_ordered_flush(bh) __buffer_state(bh,Ordered_Flush)
+
#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)

extern void set_bh_page(struct buffer_head *bh, struct page *page, unsigned long offset);
diff -urN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.4.20/include/linux/ide.h linux/include/linux/ide.h
--- /opt/kernel/linux-2.4.20/include/linux/ide.h 2002-11-29 00:53:15.000000000 +0100
+++ linux/include/linux/ide.h 2002-11-26 17:36:30.000000000 +0100
@@ -381,6 +381,8 @@
unsigned autotune : 2; /* 1=autotune, 2=noautotune, 0=default */
unsigned remap_0_to_1 : 2; /* 0=remap if ezdrive, 1=remap, 2=noremap */
unsigned ata_flash : 1; /* 1=present, 0=default */
+ unsigned quiet : 1;
+ unsigned doing_barrier : 1; /* barrier sequence in progress */
unsigned addressing; /* : 2; 0=28-bit, 1=48-bit, 2=64-bit */
byte scsi; /* 0=default, 1=skip current ide-subdriver for ide-scsi emulation */
byte media; /* disk, cdrom, tape, floppy, ... */
@@ -428,6 +430,7 @@
byte acoustic; /* acoustic management */
unsigned int failures; /* current failure count */
unsigned int max_failures; /* maximum allowed failure count */
+ char special_buf[4]; /* IDE_DRIVE_CMD, free use */
} ide_drive_t;

/*


--
Jens Axboe

2003-02-05 19:59:38

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: [PATCH] ide write barriers

On Wednesday 05 February 2003 17:33, Jens Axboe wrote:

Hi Jens,

> Sure, I had that one already. BTW, I discovered that the default io
thank you :)

> scheduler forgets to honor the cmd_flags, it's supposed to break like
> the noop does (see very first hunk in very first file). Must have
> removed that by mistake some time ago... This applies both to the
> 2.4.21-pre4 patch posted and this one.
well, I am impressed, really!

As you described in the patch:

+ * For journalled file systems, doing ordered writes on a commit
+ * block instead of explicitly doing wait_on_buffer (which is bad
+ * for performance) can be a big win. Block drivers supporting this
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I don't have benchmarks handy yet but as far as I can _feel_, this is a _MUST_
(I repeat: a _MUST_ for 2.4.21). And I am very good in feeling slowdowns for
interactivity :)

I am running it for quite some hours now with 2.4.20. Well, maybe the
nr_requests = 16 and read/write passovers changes in the elevator code give
us more smoothness than w/o but in my theoretical mind, this should drop
throughput. I also noticed, these changes aren't in your 2.4.21 patch. Can
you explain why it is in 2.4.20 patch or why it isn't in 2.4.21 patch ? :)

Thanks alot.

/ME calls out for Con to do a benchmark with the 2.4.21 patch.

ciao, Marc


2003-02-06 09:17:06

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] ide write barriers

On Wed, Feb 05 2003, Marc-Christian Petersen wrote:
> On Wednesday 05 February 2003 17:33, Jens Axboe wrote:
>
> Hi Jens,
>
> > Sure, I had that one already. BTW, I discovered that the default io
> thank you :)
>
> > scheduler forgets to honor the cmd_flags, it's supposed to break like
> > the noop does (see very first hunk in very first file). Must have
> > removed that by mistake some time ago... This applies both to the
> > 2.4.21-pre4 patch posted and this one.
> well, I am impressed, really!
>
> As you described in the patch:
>
> + * For journalled file systems, doing ordered writes on a commit
> + * block instead of explicitly doing wait_on_buffer (which is bad
> + * for performance) can be a big win. Block drivers supporting this
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> I don't have benchmarks handy yet but as far as I can _feel_, this is
> a _MUST_ (I repeat: a _MUST_ for 2.4.21). And I am very good in
> feeling slowdowns for interactivity :)
>
> I am running it for quite some hours now with 2.4.20. Well, maybe the
> nr_requests = 16 and read/write passovers changes in the elevator code give
> us more smoothness than w/o but in my theoretical mind, this should drop
> throughput. I also noticed, these changes aren't in your 2.4.21 patch. Can
> you explain why it is in 2.4.20 patch or why it isn't in 2.4.21 patch ? :)

There are not in the 2.4.21-pre patch, because the 2.4.20 got mixed with
another patch. The limiting of number of queue requests is a different
patch, and it's quite likely it will provide better interactiveness for
you. But it will also considerably drop throughput, depends on the work
load whether that's the case or not.

I don't think ext3 will show a performance boost with the supplied
patch, but the stuff for reiser that Christ did definitely will.

--
Jens Axboe

2003-02-26 20:21:25

by Scott Lee

[permalink] [raw]
Subject: Re: [PATCH] ide write barriers

> The goal is to make the use of write
> back cache enabled ide drives safe with journalled file systems.

Does this mean that having write caching enabled is not safe if you are using ext3 on an IDE drive? Should "hdparm -W 0 /dev/hda" be used for example. (I see a 50% performance hit using "-W 0" when my box is under load.) If this is the case, what is the root cause? Do IDE drives reorder writes when they are cached?

Thank you in advance for any guidance.

Scott Lee

2003-02-26 20:43:32

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] ide write barriers

On Wed, Feb 26 2003, Scott Lee wrote:
> > The goal is to make the use of write
> > back cache enabled ide drives safe with journalled file systems.
>
> Does this mean that having write caching enabled is not safe if you
> are using ext3 on an IDE drive? Should "hdparm -W 0 /dev/hda" be used
> for example. (I see a 50% performance hit using "-W 0" when my box is
> under load.) If this is the case, what is the root cause? Do IDE
> drives reorder writes when they are cached?

As it stands, it's not safe to use write back caching on IDE drives.
When the write completes as seen from the fs, it's not on the platter
yet. That's a problem. And as you mention, there's no guarentee that
writes won't be reordered as well.

So yes, either use the barrier patch or disable write caching.

--
Jens Axboe

Subject: RE: [PATCH] ide write barriers

> On Wed, Feb 26 2003, Scott Lee wrote:
> > > The goal is to make the use of write
> > > back cache enabled ide drives safe with journalled file systems.
> >
> > Does this mean that having write caching enabled is not safe if you
> > are using ext3 on an IDE drive? Should "hdparm -W 0
> /dev/hda" be used
> > for example. (I see a 50% performance hit using "-W 0"
> when my box is
> > under load.) If this is the case, what is the root cause? Do IDE
> > drives reorder writes when they are cached?
>
> As it stands, it's not safe to use write back caching on IDE drives.
> When the write completes as seen from the fs, it's not on the platter
> yet. That's a problem. And as you mention, there's no guarentee that
> writes won't be reordered as well.
>
> So yes, either use the barrier patch or disable write caching.

Unless I'm missing something the effect of write caching will be nil from a
safety standpoint *if* the drive does *not* reorder writes. If cached
writes are written to the platter in order it seems that a loss of power
will simply mean that from the platter perspective the system will look like
the power was lost at "T-x" rather than "T" where "T" is actual time of the
outage and "x" is the age of the oldest piece of cached data. The net
effect is that the filesystem should be in no worse shape than if there were
no caching the power actually went out at "T-x". (Unless of course I am
missing something here.)

If the cached writes can be reordered then of course it stands that caching
would be unsafe. Does anyone know if IDE drives do this? I'm certainly no
expert in this area but I thought only SCSI drives reordered operations.

Also, do you happen to know if the barrier patch will apply cleanly to a RH
2.4.18-rc1-ac2 kernel and function properly?

Regards,
Scott Lee

2003-02-26 21:58:01

by Alan

[permalink] [raw]
Subject: RE: [PATCH] ide write barriers

On Wed, 2003-02-26 at 21:20, LEE,SCOTT (HP-Roseville,ex1) wrote:
> If the cached writes can be reordered then of course it stands that caching
> would be unsafe. Does anyone know if IDE drives do this? I'm certainly no
> expert in this area but I thought only SCSI drives reordered operations.

Undefined. There certainly seem to be drives whose physical sector size
exceeds the one it presents to the world. Modern disk is all smoke and
mirrors for compatibility hack on top of compatibility hack.

I can't find anything in the spec that requires drives do not. Nor can
I find anything in the spec requiring a drive implements the cache flush
as anything except a nop [I believe ATA6 changed that for the very
newest stuff]. Not all drives appear to honour any kind of cache disable
either.

2003-02-27 22:15:55

by Andre Hedrick

[permalink] [raw]
Subject: Re: [PATCH] ide write barriers

On Wed, 26 Feb 2003, Scott Lee wrote:

> > The goal is to make the use of write
> > back cache enabled ide drives safe with journalled file systems.
>
> Does this mean that having write caching enabled is not safe if you are
> using ext3 on an IDE drive? Should "hdparm -W 0 /dev/hda" be used for
> example. (I see a 50% performance hit using "-W 0" when my box is under
> load.) If this is the case, what is the root cause? Do IDE drives
> reorder writes when they are cached?

All drives reorder unless instructed not to do so.
ATA drives are just now getting FUA hooks.
So flush cache often.

Cheers,

Andre Hedrick
LAD Storage Consulting Group