2002-02-13 12:52:23

by Jens Axboe

[permalink] [raw]
Subject: [PATCH] queue barrier support

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.298 -> 1.299
# fs/reiserfs/journal.c 1.27 -> 1.28
# fs/reiserfs/super.c 1.28 -> 1.29
# include/linux/reiserfs_fs_sb.h 1.9 -> 1.10
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/02/13 [email protected] 1.299
# reiserfs barrier write support
# --------------------------------------------
#
diff -Nru a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
--- a/fs/reiserfs/journal.c Wed Feb 13 13:50:34 2002
+++ b/fs/reiserfs/journal.c Wed Feb 13 13:50:34 2002
@@ -115,6 +115,21 @@
return 0 ;
}

+static int barrier_wait_on_buffer(struct super_block *s, struct buffer_head *bh)
+{
+ int ordered = test_and_clear_bit(BH_Ordered, &bh->b_state) ;
+ if (!reiserfs_barrier(s) || ordered) {
+ wait_on_buffer(bh) ;
+ }
+ return 0 ;
+}
+
+static void mark_barrier(struct super_block *s, struct buffer_head *bh)
+{
+ if (reiserfs_barrier(s))
+ set_bit(BH_Ordered, &bh->b_state);
+}
+
static struct reiserfs_bitmap_node *
allocate_bitmap_node(struct super_block *p_s_sb) {
struct reiserfs_bitmap_node *bn ;
@@ -721,7 +736,7 @@
bn = SB_ONDISK_JOURNAL_1st_BLOCK(s) + (jl->j_start + i) % SB_ONDISK_JOURNAL_SIZE(s) ;
tbh = get_hash_table(SB_JOURNAL_DEV(s), bn, s->s_blocksize) ;

- wait_on_buffer(tbh) ;
+ barrier_wait_on_buffer(s, tbh) ;
if (!buffer_uptodate(tbh)) {
reiserfs_panic(s, "journal-601, buffer write failed\n") ;
}
@@ -741,9 +756,10 @@
atomic_read(&(jl->j_commit_left)));
}

+ mark_barrier(s, jl->j_commit_bh);
mark_buffer_dirty(jl->j_commit_bh) ;
ll_rw_block(WRITE, 1, &(jl->j_commit_bh)) ;
- wait_on_buffer(jl->j_commit_bh) ;
+ barrier_wait_on_buffer(s, jl->j_commit_bh) ;
if (!buffer_uptodate(jl->j_commit_bh)) {
reiserfs_panic(s, "journal-615: buffer write failed\n") ;
}
@@ -835,8 +851,9 @@
jh->j_first_unflushed_offset = cpu_to_le32(offset) ;
jh->j_mount_id = cpu_to_le32(SB_JOURNAL(p_s_sb)->j_mount_id) ;
set_bit(BH_Dirty, &(SB_JOURNAL(p_s_sb)->j_header_bh->b_state)) ;
+ mark_barrier(p_s_sb, SB_JOURNAL(p_s_sb)->j_header_bh);
ll_rw_block(WRITE, 1, &(SB_JOURNAL(p_s_sb)->j_header_bh)) ;
- wait_on_buffer((SB_JOURNAL(p_s_sb)->j_header_bh)) ;
+ barrier_wait_on_buffer(p_s_sb, (SB_JOURNAL(p_s_sb)->j_header_bh)) ;
if (!buffer_uptodate(SB_JOURNAL(p_s_sb)->j_header_bh)) {
printk( "reiserfs: journal-837: IO error during journal replay\n" );
return -EIO ;
@@ -1057,7 +1074,7 @@
if (!cn->bh) {
reiserfs_panic(s, "journal-1011: cn->bh is NULL\n") ;
}
- wait_on_buffer(cn->bh) ;
+ barrier_wait_on_buffer(s, cn->bh) ;
if (!cn->bh) {
reiserfs_panic(s, "journal-1012: cn->bh is NULL\n") ;
}
@@ -1163,7 +1180,7 @@
} else if (test_bit(BLOCK_NEEDS_FLUSH, &cn->state)) {
clear_bit(BLOCK_NEEDS_FLUSH, &cn->state) ;
if (!pjl && cn->bh) {
- wait_on_buffer(cn->bh) ;
+ barrier_wait_on_buffer(s, cn->bh) ;
}
/* check again, someone could have logged while we scheduled */
pjl = find_newer_jl_for_cn(cn) ;
diff -Nru a/fs/reiserfs/super.c b/fs/reiserfs/super.c
--- a/fs/reiserfs/super.c Wed Feb 13 13:50:34 2002
+++ b/fs/reiserfs/super.c Wed Feb 13 13:50:34 2002
@@ -380,6 +380,9 @@
reiserfs_prepare_for_journal(s, SB_BUFFER_WITH_SB(s), 1) ;
set_sb_umount_state( SB_DISK_SUPER_BLOCK(s), s->u.reiserfs_sb.s_mount_state );
journal_mark_dirty(&th, s, SB_BUFFER_WITH_SB (s));
+
+ /* force full waits when unmounting */
+ s->u.reiserfs_sb.s_mount_opt &= ~(1 << REISERFS_BARRIER) ;
}

/* note, journal_release checks for readonly mount, and can decide not
@@ -528,6 +531,9 @@
set_bit (REISERFS_NO_UNHASHED_RELOCATION, mount_options);
} else if (!strcmp (this_char, "hashed_relocation")) {
set_bit (REISERFS_HASHED_RELOCATION, mount_options);
+ } else if (!strcmp (this_char, "barrier")) {
+ set_bit (REISERFS_BARRIER, mount_options);
+ printk("reiserfs: enabling write barrier code\n") ;
} else if (!strcmp (this_char, "test4")) {
set_bit (REISERFS_TEST4, mount_options);
} else if (!strcmp (this_char, "nolog")) {
diff -Nru a/include/linux/reiserfs_fs_sb.h b/include/linux/reiserfs_fs_sb.h
--- a/include/linux/reiserfs_fs_sb.h Wed Feb 13 13:50:34 2002
+++ b/include/linux/reiserfs_fs_sb.h Wed Feb 13 13:50:34 2002
@@ -405,12 +405,8 @@
#define REISERFS_NO_BORDER 11
#define REISERFS_NO_UNHASHED_RELOCATION 12
#define REISERFS_HASHED_RELOCATION 13
-#define REISERFS_TEST4 14
-
-#define REISERFS_TEST1 11
-#define REISERFS_TEST2 12
-#define REISERFS_TEST3 13
-#define REISERFS_TEST4 14
+#define REISERFS_BARRIER 14
+#define REISERFS_TEST4 15

#define reiserfs_r5_hash(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << FORCE_R5_HASH))
#define reiserfs_rupasov_hash(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << FORCE_RUPASOV_HASH))
@@ -419,6 +415,7 @@
#define reiserfs_no_border(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << REISERFS_NO_BORDER))
#define reiserfs_no_unhashed_relocation(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << REISERFS_NO_UNHASHED_RELOCATION))
#define reiserfs_hashed_relocation(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << REISERFS_HASHED_RELOCATION))
+#define reiserfs_barrier(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << REISERFS_BARRIER))
#define reiserfs_test4(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << REISERFS_TEST4))

#define dont_have_tails(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << NOTAIL))


Attachments:
(No filename) (1.75 kB)
barrier-1-c1.295 (5.56 kB)
ide-barrier-1-c1.296 (4.15 kB)
scsi-barrier-1-c1.297 (1.85 kB)
aic7xxx-barrier-1-c1.298 (1.58 kB)
reiser-barrier-1-c1.299 (5.62 kB)
Download all attachments

2002-02-13 13:09:40

by Martin Dalecki

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

Jens Axboe wrote:

>Patches attached, comments welcome.
>
>diff -Nru a/include/linux/ide.h b/include/linux/ide.h
>--- a/include/linux/ide.h Wed Feb 13 13:48:25 2002
>+++ b/include/linux/ide.h Wed Feb 13 13:48:25 2002
>@@ -448,6 +448,7 @@
> byte acoustic; /* acoustic management */
> unsigned int failures; /* current failure count */
> unsigned int max_failures; /* maximum allowed failure count */
>+ char special_buf[4]; /* IDE_DRIVE_CMD, free use */
> } ide_drive_t;
>
ide-barrier-1-c1.296:+ memset(drive->special_buf, 0,
sizeof(drive->special_buf));
ide-barrier-1-c1.296:+ flush_rq->buffer = drive->special_buf;
ide-barrier-1-c1.296:+ char special_buf[4]; /*
IDE_DRIVE_CMD, free use */

I just don't see special_buf used anywhere. What is it supposed to be
used for?
Is the intention to make it look like the SCSI code?

>


2002-02-13 13:13:42

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

On Wed, Feb 13 2002, Martin Dalecki wrote:
> Jens Axboe wrote:
>
> >Patches attached, comments welcome.
> >
> >diff -Nru a/include/linux/ide.h b/include/linux/ide.h
> >--- a/include/linux/ide.h Wed Feb 13 13:48:25 2002
> >+++ b/include/linux/ide.h Wed Feb 13 13:48:25 2002
> >@@ -448,6 +448,7 @@
> > byte acoustic; /* acoustic management */
> > unsigned int failures; /* current failure count */
> > unsigned int max_failures; /* maximum allowed failure count */
> >+ char special_buf[4]; /* IDE_DRIVE_CMD, free use */
> >} ide_drive_t;
> >
> ide-barrier-1-c1.296:+ memset(drive->special_buf, 0,
> sizeof(drive->special_buf));
> ide-barrier-1-c1.296:+ flush_rq->buffer = drive->special_buf;
> ide-barrier-1-c1.296:+ char special_buf[4]; /*
> IDE_DRIVE_CMD, free use */
>
> I just don't see special_buf used anywhere. What is it supposed to be
> used for?
> Is the intention to make it look like the SCSI code?

See ide.c:ide_queue_flush_cmd(), I'm only using 1 of the bytes but it
has room for 4 like that is typically used when issuing an ata command
directly. So yes, it is used, I'm not adding stuff for fun :-)

--
Jens Axboe

2002-02-13 14:37:34

by Martin Dalecki

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

Jens Axboe wrote:

>On Wed, Feb 13 2002, Martin Dalecki wrote:
>
>>Jens Axboe wrote:
>>
>>>Patches attached, comments welcome.
>>>
>>>diff -Nru a/include/linux/ide.h b/include/linux/ide.h
>>>--- a/include/linux/ide.h Wed Feb 13 13:48:25 2002
>>>+++ b/include/linux/ide.h Wed Feb 13 13:48:25 2002
>>>@@ -448,6 +448,7 @@
>>> byte acoustic; /* acoustic management */
>>> unsigned int failures; /* current failure count */
>>> unsigned int max_failures; /* maximum allowed failure count */
>>>+ char special_buf[4]; /* IDE_DRIVE_CMD, free use */
>>>} ide_drive_t;
>>>
>>ide-barrier-1-c1.296:+ memset(drive->special_buf, 0,
>>sizeof(drive->special_buf));
>>ide-barrier-1-c1.296:+ flush_rq->buffer = drive->special_buf;
>>ide-barrier-1-c1.296:+ char special_buf[4]; /*
>>IDE_DRIVE_CMD, free use */
>>
>>I just don't see special_buf used anywhere. What is it supposed to be
>>used for?
>>Is the intention to make it look like the SCSI code?
>>
>
>See ide.c:ide_queue_flush_cmd(), I'm only using 1 of the bytes but it
>has room for 4 like that is typically used when issuing an ata command
>directly. So yes, it is used, I'm not adding stuff for fun :-)
>

OK I see now. Is this to become analogous to the sr_cmnd field for
struct scsi_request?
It would make sense to first make them use the same software
architecture at least and then
to merge as much of this driver mid-layer stuff as possible....


2002-02-13 14:42:05

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

On Wed, Feb 13 2002, Martin Dalecki wrote:
> >On Wed, Feb 13 2002, Martin Dalecki wrote:
> >
> >>Jens Axboe wrote:
> >>
> >>>Patches attached, comments welcome.
> >>>
> >>>diff -Nru a/include/linux/ide.h b/include/linux/ide.h
> >>>--- a/include/linux/ide.h Wed Feb 13 13:48:25 2002
> >>>+++ b/include/linux/ide.h Wed Feb 13 13:48:25 2002
> >>>@@ -448,6 +448,7 @@
> >>> byte acoustic; /* acoustic management */
> >>> unsigned int failures; /* current failure count */
> >>> unsigned int max_failures; /* maximum allowed failure count */
> >>>+ char special_buf[4]; /* IDE_DRIVE_CMD, free use */
> >>>} ide_drive_t;
> >>>
> >>ide-barrier-1-c1.296:+ memset(drive->special_buf, 0,
> >>sizeof(drive->special_buf));
> >>ide-barrier-1-c1.296:+ flush_rq->buffer = drive->special_buf;
> >>ide-barrier-1-c1.296:+ char special_buf[4]; /*
> >>IDE_DRIVE_CMD, free use */
> >>
> >>I just don't see special_buf used anywhere. What is it supposed to be
> >>used for?
> >>Is the intention to make it look like the SCSI code?
> >>
> >
> >See ide.c:ide_queue_flush_cmd(), I'm only using 1 of the bytes but it
> >has room for 4 like that is typically used when issuing an ata command
> >directly. So yes, it is used, I'm not adding stuff for fun :-)
> >
>
> OK I see now. Is this to become analogous to the sr_cmnd field for
> struct scsi_request?
> It would make sense to first make them use the same software
> architecture at least and then
> to merge as much of this driver mid-layer stuff as possible....

Well dunno, typically users of IDE_DRIVE_CMD's are using ide_wait and
thus waits for completion. So request and arguments are almost always
allocated on the stack, which is fine in that case. For out-of-order
execution like the pre and post flushes in the barrier patch, there just
needed to be some bytes available for the data command.

The SCSI sr_cmnd[] cdb is different from the ATA register set, but in
fact in 2.5 the IDE layer could use the cdb in struct request as a temp
buffer for stuff like this. It's exactly what the ide-barrier patch
above could use. So yeah I could clean special_buf[] and just use

rq->buffer = rq->cmd;

instead.

--
Jens Axboe

2002-02-13 14:47:15

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

On February 13, 2002 01:51 pm, Jens Axboe wrote:
> Patches attached, comments welcome.

A meta-comment: the BK url's are wonderfully informative and useful, but they
are long and _ugly_! Is there anything that can be done about that?

--
Daniel

2002-02-13 15:19:06

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

On Wed, Feb 13 2002, Daniel Phillips wrote:
> On February 13, 2002 01:51 pm, Jens Axboe wrote:
> > Patches attached, comments welcome.
>
> A meta-comment: the BK url's are wonderfully informative and useful, but they
> are long and _ugly_! Is there anything that can be done about that?

Yeah I like them too, maybe if I just figured out how to get BitKeeper
to dump full changeset info I could just inline them in the mail
instead. I'll look at up and try that next time.

--
Jens Axboe

2002-02-13 17:49:23

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

On Feb 13, 2002 16:18 +0100, Jens Axboe wrote:
> On Wed, Feb 13 2002, Daniel Phillips wrote:
> > On February 13, 2002 01:51 pm, Jens Axboe wrote:
> > > Patches attached, comments welcome.
> >
> > A meta-comment: the BK url's are wonderfully informative and useful, but
> > they are long and _ugly_! Is there anything that can be done about that?
>
> Yeah I like them too, maybe if I just figured out how to get BitKeeper
> to dump full changeset info I could just inline them in the mail
> instead. I'll look at up and try that next time.

bk send -wgzip_uu -r<rev> - > foo-<rev>.bk

This will dump a gzipped-uuencoded changset to the file. The receiver
just do "| bk receive [repository] -avv" to import it on the other end.

My preferred format for sending BK CSETs is below. The gzip_uu CSET
data only adds maybe 10% for large patches, and about doubles the size
of very small patches. I also created a bz64 (bzip2 + base64) wrapper
which makes the CSET data smaller, but that is only useful if other BK
developers have this wrapper also.

Cheers, Andreas
=============================== bksend ====================================
#!/bin/sh
# A script to format BK changeset output in a manner that is easy to read.
# Andreas Dilger <[email protected]> 13/02/2002

PROG=bksend

usage() {
echo "usage: $PROG -r<rev>"
echo -e "\twhere <rev> is of the form '1.23', '1.23..', '1.23..1.27',"
echo -e "\tor '+' to indicate the most recent revision"

exit 1
}

case $1 in
-r) REV=$2; shift ;;
-r*) REV=`echo $1 | sed 's/^-r//'` ;;
*) echo "$PROG: no revision given, you probably don't want that";;
esac

[ -z "$REV" ] && usage

bk changes -r$REV
bk export -tpatch -du -h -r$REV
echo -e "\n================================================================\n\n"
bk send -wgzip_uu -r$REV -
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2002-02-13 18:26:44

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

[email protected] said:
> [email protected], 2002-02-13 13:42:39+01:00, [email protected]
> Add support for SCSI drivers to indicate support for ordered tags
> http://bitmover.com:8888//tmp/v2_logging/athlon.transmeta.com/
> torvalds-2002020517305 \ 6-16047-c1d11a41ed024864/[email protected]?nav=
> index.html|ChangeSet@-1h

> [email protected], 2002-02-13 13:43:04+01:00, [email protected]
> Add ordered tag support to the aic7xxx scsi driver

The rest of the aic7xxx code uses MSG_ORDERED_TASK rather than
MSG_ORDERED_Q_TAG. You have to scan through the headers to see that these are
#defined the same.

A problem (that is probably only an issue for older drives) is that while
technically the standard requires all 3 types of TAG to be supported if tag
queueing is, some drives really only have simple tag support in their
firmware, so you may need to add a blacklist for ordered tags on certain
drives.

A further issue is that you haven't added anything to the error recovery code
for this. If error recovery is activated for the device at the reset level,
all tags will be discarded by the device. The eh will retry the failing
command and then the other tagged commands will be re-issued from the
scsi_bottom_half_handler (assuming the low level device driver immediately
fails them with DID_RESET) in the order in which the low level driver failed
them. Thus you have potentially completely messed up the ordering when the
commands all get retried.

James Bottomley



2002-02-15 09:03:07

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

On Wed, Feb 13 2002, James Bottomley wrote:
> [email protected] said:
> > [email protected], 2002-02-13 13:42:39+01:00, [email protected]
> > Add support for SCSI drivers to indicate support for ordered tags
> > http://bitmover.com:8888//tmp/v2_logging/athlon.transmeta.com/
> > torvalds-2002020517305 \ 6-16047-c1d11a41ed024864/[email protected]?nav=
> > index.html|ChangeSet@-1h
>
> > [email protected], 2002-02-13 13:43:04+01:00, [email protected]
> > Add ordered tag support to the aic7xxx scsi driver
>
> The rest of the aic7xxx code uses MSG_ORDERED_TASK rather than
> MSG_ORDERED_Q_TAG. You have to scan through the headers to see that
> these are #defined the same.

Ah, the MSG_ORDERED_TASK did probably not exist when I originally did
the patch (a year or so ago, iirc). I'll change the define to follow the
rest of the code.

> A problem (that is probably only an issue for older drives) is that
> while technically the standard requires all 3 types of TAG to be
> supported if tag queueing is, some drives really only have simple tag
> support in their firmware, so you may need to add a blacklist for
> ordered tags on certain drives.

Yes you are right.

> A further issue is that you haven't added anything to the error
> recovery code for this. If error recovery is activated for the device
> at the reset level, all tags will be discarded by the device. The eh
> will retry the failing command and then the other tagged commands will
> be re-issued from the scsi_bottom_half_handler (assuming the low level
> device driver immediately fails them with DID_RESET) in the order in
> which the low level driver failed them. Thus you have potentially
> completely messed up the ordering when the commands all get retried.

I already have this fixed locally by just maintaining a fifo list of
queued commands so we can retry them in the correct order. For 2.5
there's a busy and free list (so no more scanning for a free command
either), I would imagine that the easiest for 2.4 is to just maintain a
busy list in addition to the current array of pending/free commands.

--
Jens Axboe

2002-02-15 13:42:38

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support



On Wednesday, February 13, 2002 01:26:12 PM -0500 James Bottomley <[email protected]> wrote:

> [email protected] said:
>> [email protected], 2002-02-13 13:42:39+01:00, [email protected]
>> Add support for SCSI drivers to indicate support for ordered tags
>> http://bitmover.com:8888//tmp/v2_logging/athlon.transmeta.com/
>> torvalds-2002020517305 \ 6-16047-c1d11a41ed024864/[email protected]?nav=
>> index.html|ChangeSet@-1h
>
>> [email protected], 2002-02-13 13:43:04+01:00, [email protected]
>> Add ordered tag support to the aic7xxx scsi driver
>
> The rest of the aic7xxx code uses MSG_ORDERED_TASK rather than
> MSG_ORDERED_Q_TAG. You have to scan through the headers to see that these are
># defined the same.

Jens, my patch from yesterday has this fixed.

>
> A problem (that is probably only an issue for older drives) is that while
> technically the standard requires all 3 types of TAG to be supported if tag
> queueing is, some drives really only have simple tag support in their
> firmware, so you may need to add a blacklist for ordered tags on certain
> drives.

Yes, this could get sticky. Does anyone know if other OSes have already
done this?

>
> A further issue is that you haven't added anything to the error recovery code
> for this. If error recovery is activated for the device at the reset level,
> all tags will be discarded by the device. The eh will retry the failing
> command and then the other tagged commands will be re-issued from the
> scsi_bottom_half_handler (assuming the low level device driver immediately
> fails them with DID_RESET) in the order in which the low level driver failed
> them. Thus you have potentially completely messed up the ordering when the
> commands all get retried.

I was wondering about this, we would need to change the error handler to
fail all the requests after the barrier. I was hoping the driver
did this for us ;-)

-chris

2002-02-15 15:16:23

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

[email protected] said:
> A further issue is that you haven't added anything to the error
> recovery code for this. If error recovery is activated for the device
> at the reset level, all tags will be discarded by the device. The eh
> will retry the failing command and then the other tagged commands will
> be re-issued from the scsi_bottom_half_handler (assuming the low level
> device driver immediately fails them with DID_RESET) in the order in
> which the low level driver failed them. Thus you have potentially
> completely messed up the ordering when the commands all get retried.

[email protected] said:
> I already have this fixed locally by just maintaining a fifo list of
> queued commands so we can retry them in the correct order. For 2.5
> there's a busy and free list (so no more scanning for a free command
> either), I would imagine that the easiest for 2.4 is to just maintain
> a busy list in addition to the current array of pending/free commands.

[email protected] said:
> I was wondering about this, we would need to change the error handler
> to fail all the requests after the barrier. I was hoping the driver
> did this for us ;-)

Unfortunately, this is going to involve deep hackery inside the error handler.
The current initial premise is that it can simply retry the failing command
by issuing an ABORT to the tag and resending it (which can cause a tag to move
past your barrier). In an error situation, it really wouldn't be wise to try
to abort lots of potentially running tags to preserve the barrier ordering
(because of the overload placed on a known failing component), so I think the
error handler has to abandon the concept of aborting commands and move
straight to device reset. We then carefully resend the commands in FIFO order.

Additionally, you must handle the case that a device is reset by something
else (in error handler terms, the cc_ua [check condition/unit attention]).
Here also, the tags would have to be sent back down in FIFO order as soon as
the condition is detected.

[email protected] said:
> Yes, this could get sticky. Does anyone know if other OSes have
> already done this?

Other OSs (well the ones I've heard about: Solaris and HP-UX) try to avoid
ordered tags, mainly because of the performance impact they have---the drive
tag service algorithms become inefficient in the presence of ordered tags
since they're usually optimised for all simple tags.

James


2002-02-15 16:29:21

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support


On Friday, February 15, 2002 10:15:58 AM -0500 James Bottomley <[email protected]> wrote:

> [email protected] said:
>> I was wondering about this, we would need to change the error handler
>> to fail all the requests after the barrier. I was hoping the driver
>> did this for us ;-)
>
> Unfortunately, this is going to involve deep hackery inside the error handler.
> The current initial premise is that it can simply retry the failing command
> by issuing an ABORT to the tag and resending it (which can cause a tag to move
> past your barrier). In an error situation, it really wouldn't be wise to try
> to abort lots of potentially running tags to preserve the barrier ordering
> (because of the overload placed on a known failing component), so I think the
> error handler has to abandon the concept of aborting commands and move
> straight to device reset. We then carefully resend the commands in FIFO order.
>

Ok, I'll try to narrow the barrier usage a bit, I'm waiting on the
barrier write once it is sent, so I'm not worried about anything done
after the ordered tag.

write X log blocks (simple tag)
write 1 commit block (ordered tag)
wait on all of them.

All I care about is knowing that all of the log blocks hit the disk
before the commit. So, if one of the log blocks aborts, I want it
to abort the commit too. Is this a little easier to implement?

> Additionally, you must handle the case that a device is reset by something
> else (in error handler terms, the cc_ua [check condition/unit attention]).
> Here also, the tags would have to be sent back down in FIFO order as soon as
> the condition is detected.
>
> [email protected] said:
>> Yes, this could get sticky. Does anyone know if other OSes have
>> already done this?
>
> Other OSs (well the ones I've heard about: Solaris and HP-UX) try to avoid
> ordered tags, mainly because of the performance impact they have---the drive
> tag service algorithms become inefficient in the presence of ordered tags
> since they're usually optimised for all simple tags.

I do see that on my drives ;-) The main reason I think its worth trying
is because the commit block is directly adjacent to the last log block,
so I'm hoping the drive can optimize the commit (even though it is ordered)
better when the OS sends it directly after the last log block.

While I've got linux-scsi cc'd, I'll reask a question from yesterday.
Do the targets with write back caches usually ignore the order tag,
doing the write in the most efficient way possible instead?

-chris



2002-02-15 16:44:31

by Mike Anderson

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

James Bottomley [[email protected]] wrote:
> [email protected] said:
>
> Unfortunately, this is going to involve deep hackery inside the error handler.
> The current initial premise is that it can simply retry the failing command
> by issuing an ABORT to the tag and resending it (which can cause a tag to move
> past your barrier). In an error situation, it really wouldn't be wise to try
> to abort lots of potentially running tags to preserve the barrier ordering
> (because of the overload placed on a known failing component), so I think the
> error handler has to abandon the concept of aborting commands and move
> straight to device reset. We then carefully resend the commands in FIFO order.
>
> Additionally, you must handle the case that a device is reset by something
> else (in error handler terms, the cc_ua [check condition/unit attention]).
> Here also, the tags would have to be sent back down in FIFO order as soon as
> the condition is detected.

I agree on the hacker. You also will need to clean the pipe of the unit
attention post the reset or the first command you send down will
be coming right back into the error handler and could get you in to a
error recovery storm.


>
> James
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Michael Anderson
[email protected]

2002-02-15 16:52:11

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

[email protected] said:
> While I've got linux-scsi cc'd, I'll reask a question from yesterday.
> Do the targets with write back caches usually ignore the order tag,
> doing the write in the most efficient way possible instead?

I'll try to answer, although I'm not quite sure of the basis of the question.

No ordinary SCSI drive that's not on a battery backed circuit should *ever*
use writeback caching. They should all come by default as write through. In
this case, the tag is acknowleged as completed only when the write has made it
to the medium.

If you alter the drive parameter page to turn on write back caching, it's
caveat emptor. Since you're now telling the drive that you consider hitting
the cache to be equivalent to hitting the medium (because you know better
about power failures and the like) it is undefined by the standards how the
drives write from the cache to the medium---and you shouldn't care about this
if you have arranged your system to do write back caching. They will
acknowlege the tag as completed as soon as it hits the cache, and the ordered
tag will be order it commits to the cache.

Note, for high end disk arrays, the reverse is usually true since they often
have internal battery backups for their caches and drives.

If you lose power to the drive that does write back caching before the cache
flushes, all the indications you've given the journalling fs about write
commits are voided and the fs is probably inconsistent and not recoverable by
a journal replay.

Note also that on system shutdown, most devices that use write back caching
are also expecting a cache flush instruction from the node, which Linux
doesn't send.

James


2002-02-15 17:09:51

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

[email protected] said:
> Ok, I'll try to narrow the barrier usage a bit, I'm waiting on the
> barrier write once it is sent, so I'm not worried about anything done
> after the ordered tag.

> write X log blocks (simple tag) write 1 commit block (ordered tag)
> wait on all of them.

> All I care about is knowing that all of the log blocks hit the disk
> before the commit. So, if one of the log blocks aborts, I want it to
> abort the commit too. Is this a little easier to implement?

Actually, I'm afraid not. The FS has the notion of a transaction to help it
with this. All the SCSI driver sees is a list of IOs with some requests for
ordered tags. If we go into error recovery and have to abort a tag, how do we
know which of the outstanding ordered tags is actually the commit block for
this transaction? (Remember also that a FS sits on a partition, not a physical
SCSI device, so even if you single thread your commits per FS, there could
still be multiple FSs on the same physical device).

That's why I think it's safer to abandon the abort entirely. If our first
line of error recovery is device reset, we know we've just cleared the entire
outstanding tag queue and we can resend all the tags in FIFO order which means
that we still don't need a concept of "transaction" at the SCSI level (which
is a good thing, I think), all we need to know is the order in which the tags
came to us.

Even if some tags were committed but not acknowledged at reset time, it's
still safe to resend them in the correct ordered sequence because of
transaction idempotence of block writes.

James


2002-02-15 17:18:31

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support



On Friday, February 15, 2002 11:51:45 AM -0500 James Bottomley <[email protected]> wrote:

> [email protected] said:
>> While I've got linux-scsi cc'd, I'll reask a question from yesterday.
>> Do the targets with write back caches usually ignore the order tag,
>> doing the write in the most efficient way possible instead?
>
> I'll try to answer, although I'm not quite sure of the basis of the question.
>
> No ordinary SCSI drive that's not on a battery backed circuit should *ever*
> use writeback caching. They should all come by default as write through. In
> this case, the tag is acknowleged as completed only when the write has made it
> to the medium.

Right, friends don't let friends use non-battery back write back caches ;-)

>
> If you alter the drive parameter page to turn on write back caching, it's
> caveat emptor. Since you're now telling the drive that you consider hitting
> the cache to be equivalent to hitting the medium (because you know better
> about power failures and the like) it is undefined by the standards how the
> drives write from the cache to the medium---and you shouldn't care about this
> if you have arranged your system to do write back caching. They will
> acknowlege the tag as completed as soon as it hits the cache, and the ordered
> tag will be order it commits to the cache.

Ok, this is what I was expecting. The performance improvement from
using the ordered tags on single drives doesn't seem to be very big, and
given the problems with failed requests, I'm thinking about dropping the
scsi parts of the 2.4 barrier patch, and just worrying about making ide
drives flush things correctly. The hard stuff on error recovery can
be tackled in 2.5 and (maybe) ported back later.

But, if high end controllers with write back caching see improvements
because they can combine the log writes better with the patch, it might
be worth keeping, but modified in reiserfs so it flushes IDE caches
by default, but only sends ordered tags when the admin mounts
with -o barrier_ordered or something.

thanks for the info
-chris

2002-02-15 17:49:16

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

[email protected] said:
> I'm thinking about dropping the scsi parts of the 2.4 barrier patch,
> and just worrying about making ide drives flush things correctly.
> The hard stuff on error recovery can be tackled in 2.5 and (maybe)
> ported back later.

That's probably best. I do agree that 2.5 is the place to play around with
SCSI error handling first.

I'm willing to help re-do the error handler, since I've always thought that
abort isn't a good first line of defence because it actually adds to the
command burden of a failing drive.

Jens, if you want to share the code you already have (or point me to the
bitkeeper repository where you keep it) I'll look it over.

As far as the back-port to 2.4, as long as you only support SCSI drivers that
use the new error handler (so we don't have to worry about the obsolete one)
that should be reasonably easy.

James


2002-02-15 22:32:09

by Matthias Andree

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

On Fri, 15 Feb 2002, James Bottomley wrote:

> Note also that on system shutdown, most devices that use write back caching
> are also expecting a cache flush instruction from the node, which Linux
> doesn't send.

Hair splitting: it looks as though Andre Hedrick's IDE patch did this at
least. Of course, that does not affect SCSI drives, but since you wrote
"Linux", I thought this could be mentioned in this thread again. At
least, my ATA drives are powered down some seconds before my ATX PC is,
and the kernel says about flushing the cache.

--
Matthias Andree

"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." Benjamin Franklin

2002-02-16 10:16:14

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

On February 13, 2002 07:26 pm, James Bottomley wrote:
> A problem (that is probably only an issue for older drives) is that while
> technically the standard requires all 3 types of TAG to be supported if tag
> queueing is, some drives really only have simple tag support in their
> firmware, so you may need to add a blacklist for ordered tags on certain
> drives.

>From user space, I hope.

--
Daniel

2002-02-16 15:02:48

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] queue barrier support

On February 13, 2002 07:26 pm, James Bottomley wrote:
> A problem (that is probably only an issue for older drives) is that while
> technically the standard requires all 3 types of TAG to be supported if tag
> queueing is, some drives really only have simple tag support in their
> firmware, so you may need to add a blacklist for ordered tags on certain
> drives.

[email protected] said:
> From user space, I hope.

Well, following the current SCSI black/white list procedure, they would be in
the static device_list table in scsi_scan.c, so no.

I agree that it's not a nice or friendly thing (and is certainly prone to
delays when you have to get entries into the actual kernel code), but fixing
this particular annoyance of the scsi subsystem (although I have actually
thought about doing it) would be a project separate from the queue barrier
support.

James





2002-02-25 20:55:46

by Cindy Sweet

[permalink] [raw]
Subject: 929-Emulex, ABTS Command Anamoly!

Is this the right group to discuss about the ABTS
command ??

__________________________________________________
Do You Yahoo!?
Yahoo! Sports - Coverage of the 2002 Olympic Games
http://sports.yahoo.com

2002-02-25 21:04:17

by Arjan van de Ven

[permalink] [raw]
Subject: Re: 929-Emulex, ABTS Command Anamoly!

In article <[email protected]> you wrote:
> Is this the right group to discuss about the ABTS
> command ??

no.

emulex is a binary only driver so it should be discussed with emulex
if you have problems...