LinuxLists.cc - [PATCH] 2.4.x write barriers (updated for ext3)

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

Hi,

On Thu, Feb 21, 2002 at 06:30:20PM -0500, Chris Mason wrote:

> This makes it much easier to add support for ide writeback
> flushing to things like ext3 and lvm, where I want to make
> the minimal possible changes to make things safe.

Nice.

> There might be additional spots in ext3 where ordering needs to be
> enforced, I've included the ext3 code below in hopes of getting
> some comments.

No. However, there is another optimisation which we can make.

Most ext3 commits, in practice, are lazy, asynchronous commits, and we
only nedd BH_Ordered_Tag for that, not *_Flush. It would be easy
enough to track whether a given transaction has any synchronous
waiters, and if not, to use the async *_Tag request for the commit
block instead of forcing a flush.

We'd also have to track the sync status of the most recent
transaction, too, so that on fsync of a non-dirty file/inode, we make
sure that its data had been forced to disk by at least one synchronous
flush.

But that's really only a win for SCSI, where proper async ordered tags
are supported. For IDE, the single BH_Ordered_Flush is quite
sufficient.

Cheers,
Stephen

2002-02-22 15:27:39

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Friday, February 22, 2002 02:19:15 PM +0000 "Stephen C. Tweedie" <[email protected]> wrote:

>> There might be additional spots in ext3 where ordering needs to be
>> enforced, I've included the ext3 code below in hopes of getting
>> some comments.
>
> No. However, there is another optimisation which we can make.
>
> Most ext3 commits, in practice, are lazy, asynchronous commits, and we
> only nedd BH_Ordered_Tag for that, not *_Flush. It would be easy
> enough to track whether a given transaction has any synchronous
> waiters, and if not, to use the async *_Tag request for the commit
> block instead of forcing a flush.

Just a note, the scsi code doesn't implement flush at all, flush
either gets ignored or failed (if BH_Ordered_Hard is set), the
assumption being that scsi devices don't write back by default, so
wait_on_buffer() is enough.

The reiserfs code tries to be smart with _Tag, in pratice I haven't
found a device that gains from it, so I didn't want to make the larger
changes to ext3 until I was sure it was worthwhile ;-)

It seems the scsi drives don't do tag ordering as nicely as we'd
hoped, I'm hoping someone with a big raid controller can help
benchmark the ordered tag mode on scsi. Also, check the barrier
threads from last week on how write errors might break the
ordering with the current scsi code.

-chris

2002-02-22 15:58:21

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

> Most ext3 commits, in practice, are lazy, asynchronous commits, and we
> only nedd BH_Ordered_Tag for that, not *_Flush. It would be easy
> enough to track whether a given transaction has any synchronous
> waiters, and if not, to use the async *_Tag request for the commit
> block instead of forcing a flush.

> We'd also have to track the sync status of the most recent
> transaction, too, so that on fsync of a non-dirty file/inode, we make
> sure that its data had been forced to disk by at least one synchronous
> flush.

> But that's really only a win for SCSI, where proper async ordered tags
> are supported. For IDE, the single BH_Ordered_Flush is quite
> sufficient.

Unfortunately, there's actually a hole in the SCSI spec that means ordered
tags are actually extremely difficult to use in the way you want (although I
think this is an accident, conceptually, I think they were supposed to be used
for this). For the interested, I attach the details at the bottom.

The easy way out of the problem, I think, is to impose the barrier as an
effective queue plug in the SCSI mid-layer, so that after the mid-layer
recevies the barrier, it plugs the device queue from below, drains the drive
tag queue, sends the barrier and unplugs the device queue on barrier I/O
completion.

Ordinarily, this would produce extremely poor performance since you're
effectively starving the device to implement the barrier. However, in Linux
it might just work because it will give the elevator more time to coalesce
requests.

James Bottomley

Problems Using Ordered Tags as a Barrier
========================================

Note, the following is independent of re-ordering on error conditions which
was discussed in a previous thread. This discussion pertains to normal device
operations.

The SCSI tag system allows all devices to have a dynamic queue. This means
that there is no a priori guarantee about how many tags the device will accept
before the queue becomes full.

The problem comes because most drivers issue SCSI commands directly from the
incoming I/O thread but complete them via the SCSI interrupt routine. What
this means is that if the SCSI device decides it has no more resources left,
the driver won't be told until it recevies an interrupt indicating that the
queue is full and the particular I/O wasn't queued. At this point, the user
threads may have send down several more I/Os, and worse still, the SCSI device
may have accepted some of the later I/Os because the local conditions causing
it to signal queue full may have abated.

As I read the standard, there's no way to avoid this problem, since the queue
full signal entitles the device not to queue the command, and not to act on
any ordering the command may have had.

The other problem is actually driver related, not SCSI. Most HBA chips are
intelligent beasts which can process independently of the host CPU.
Unfortunately, implementing linked lists tends to be rather beyond their
capabilities. For this reason, most low level drivers have a certain number
of queue slots (per device, per chip or whatever). The usual act of feeding
an I/O command to a device involves stuffing it in the first available free
slot. This can lead to command re-ordering because even though the HBA is
sequentially processing slots in a round-robin fashion, you don't often know
which slot it is currently looking at. Also, the multi threaded nature of tag
command queuing means that the slot will remain full long after the HBA has
started processing it and moved on to the next slot.

One possible get out is to process the queue full signal internally (either in
the interrupt routine or in the chip driver itself) to force it to re-send of
the non-queued tag until the drive actually accepts it. As long as this
looping is done at a level which prevents the device from accepting any more
commands. In general, this is nasty because it is effectively a busy wait
inside the HBA and will block commands to all other devices until the device
queue drained sufficiently to accept the tag.

The other possibility would be to treat all pending commands for a particular
device as queue full errors if we get that for one of them. This would
require the interrupt or chip script routine to complete all commands for the
particular device as queue full, which would still be quite a large amount of
work for device driver writers.

Finally, I think the driver ordering problem can be solved easily as long as
an observation I have about your barrier is true. It seems to me that the
barrier is only semi permeable, namely its purpose is to complete *after* a
particular set of commands do. This means that it doesnt matter if later
commands move through the barrier, it only matters that earlier commands
cannot move past it? If this is true, then we can fix the slot problem simply
by having a slot dedicated to barrier tags, so the processing engine goes over
it once per cycle. However, if it finds the barrier slot full, it doesn't
issue the command until the *next* cycle, thus ensuring that all commands sent
down before the barrier (plus a few after) are accepted by the device queue
before we send the barrier with its ordered tag.

2002-02-22 16:11:29

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Friday, February 22, 2002 10:57:51 AM -0500 James Bottomley <[email protected]> wrote:

[ very interesting stuff ]

> Finally, I think the driver ordering problem can be solved easily as long as
> an observation I have about your barrier is true. It seems to me that the
> barrier is only semi permeable, namely its purpose is to complete *after* a
> particular set of commands do.

This is my requirement for reiserfs, where I still want to wait on the
commit block to check for io errors. sct might have other plans.

> This means that it doesnt matter if later
> commands move through the barrier, it only matters that earlier commands
> cannot move past it? If this is true, then we can fix the slot problem simply
> by having a slot dedicated to barrier tags, so the processing engine goes over
> it once per cycle. However, if it finds the barrier slot full, it doesn't
> issue the command until the *next* cycle, thus ensuring that all commands sent
> down before the barrier (plus a few after) are accepted by the device queue
> before we send the barrier with its ordered tag.

Interesting, certainly sounds good.

-chris

2002-02-22 16:13:29

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

Hi,

On Fri, Feb 22, 2002 at 10:57:51AM -0500, James Bottomley wrote:

> Finally, I think the driver ordering problem can be solved easily as long as
> an observation I have about your barrier is true. It seems to me that the
> barrier is only semi permeable, namely its purpose is to complete *after* a
> particular set of commands do. This means that it doesnt matter if later
> commands move through the barrier, it only matters that earlier commands
> cannot move past it?

No. A commit block must be fully ordered.

If the commit block fails to be written, then we must be able to roll
the filesystem back to the consistent, pre-commit state, which implies
that any later IOs (which might be writeback IOs updating
now-committed metadata to final locations on disk) must not be allowed
to overtake the commit block.

However, in the current code, we don't assume that ordered queuing
works, so that later writeback will never be scheduled until we get a
positive completion acknowledgement for the commit block. In other
words, right now, the scenario you describe is not a problem.

But ideally, with ordered queueing we would want to be able to relax
things by allowing writeback to be queued immediately the commit is
queued. The ordered tag must be honoured in both directions in that
case.

There is a get-out for ext3 --- we can submit new journal IOs without
waiting for the commit IO to complete, but hold back on writeback IOs.
That still has the desired advantage of allowing us to stream to the
journal, but only requires that the commit block be ordered with
respect to older, not newer, IOs. That gives us most of the benefits
of tagged queuing without any problems in your scenario.

--Stephen

2002-02-22 17:36:53

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

[email protected] said:
> There is a get-out for ext3 --- we can submit new journal IOs without
> waiting for the commit IO to complete, but hold back on writeback IOs.
> That still has the desired advantage of allowing us to stream to the
> journal, but only requires that the commit block be ordered with
> respect to older, not newer, IOs. That gives us most of the benefits
> of tagged queuing without any problems in your scenario.

Actually, I intended the tagged queueing discussion to be discouraging. The
amount of work that would have to be done to implement it is huge, touching,
as it does, every low level driver's interrupt routine. For the drivers that
require scripting changes to the chip engine, it's even worse: only someone
with specialised knowledge can actually make the changes.

It's feasible, but I think we'd have to demonstrate some quite significant
performance or other improvements before changes on this scale would fly.

Neither of you commented on the original suggestion. What I was wondering is
if we could benchmark (or preferably improve on) it:

[email protected] said:
> The easy way out of the problem, I think, is to impose the barrier as
> an effective queue plug in the SCSI mid-layer, so that after the
> mid-layer recevies the barrier, it plugs the device queue from below,
> drains the drive tag queue, sends the barrier and unplugs the device
> queue on barrier I/O completion.

If you need strict barrier ordering, then the queue is double plugged since
the barrier has to be sent down and waited for on its own. If you allow the
discussed permiability, the queue is only single plugged since the barrier can
be sent down along with the subsequent writes.

I can take a look at implementing this in the SCSI mid-layer and you could see
what the benchmark figures look like with it in place. If it really is the
performance pig it looks like, then we could go back to the linux-scsi list
with the tag change suggestions.

James

2002-02-22 18:15:19

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Friday, February 22, 2002 12:36:22 PM -0500 James Bottomley <[email protected]> wrote:

> [email protected] said:
>> There is a get-out for ext3 --- we can submit new journal IOs without
>> waiting for the commit IO to complete, but hold back on writeback IOs.
>> That still has the desired advantage of allowing us to stream to the
>> journal, but only requires that the commit block be ordered with
>> respect to older, not newer, IOs. That gives us most of the benefits
>> of tagged queuing without any problems in your scenario.
>
> Actually, I intended the tagged queueing discussion to be discouraging.

;-)

> The
> amount of work that would have to be done to implement it is huge, touching,
> as it does, every low level driver's interrupt routine. For the drivers that
> require scripting changes to the chip engine, it's even worse: only someone
> with specialised knowledge can actually make the changes.
>
> It's feasible, but I think we'd have to demonstrate some quite significant
> performance or other improvements before changes on this scale would fly.

Very true. At best, we pick one card we know it could work on, and
one target that we know is smart about tags, and try to demonstrate
the improvement.

>
> Neither of you commented on the original suggestion. What I was wondering is
> if we could benchmark (or preferably improve on) it:
>
> [email protected] said:
>> The easy way out of the problem, I think, is to impose the barrier as
>> an effective queue plug in the SCSI mid-layer, so that after the
>> mid-layer recevies the barrier, it plugs the device queue from below,
>> drains the drive tag queue, sends the barrier and unplugs the device
>> queue on barrier I/O completion.

The main way the barriers could help performance is by allowing the
drive to write all the transaction and commit blocks at once. Your
idea increases the chance the drive heads will still be correctly
positioned to write the commit block, but doesn't let the drive
stream things better.

The big advantage to using wait_on_buffer() instead is that it doesn't
order against data writes at all (from bdflush, or some other proc
other than a commit), allowing the drive to optimize those
at the same time it is writing the commit. Using ordered tags has the
same problem, it might just be that wait_on_buffer is the best way to
go.

-chris

2002-02-25 10:59:13

by Helge Hafting

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

James Bottomley wrote:
[...]
> Unfortunately, there's actually a hole in the SCSI spec that means ordered
> tags are actually extremely difficult to use in the way you want (although I
> think this is an accident, conceptually, I think they were supposed to be used
> for this). For the interested, I attach the details at the bottom.
>
[...]
> The SCSI tag system allows all devices to have a dynamic queue. This means
> that there is no a priori guarantee about how many tags the device will accept
> before the queue becomes full.
>

I just wonder - isn't the amount of outstanding requests a device
can handle constant? If so, the user could determine this (from spec or
by running an utility that generates "too much" traffic.)

The max number of requests may then be compiled in or added as
a kernel boot parameter. The kernel would honor this and never ever
have more outstanding requests than it believes the device
can handle.

Those who don't want to bother can use some low default or accept the
risk.

Helge Hafting

2002-02-25 15:04:53

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

[email protected] said:
> I just wonder - isn't the amount of outstanding requests a device can
> handle constant? If so, the user could determine this (from spec or
> by running an utility that generates "too much" traffic.)

The spec doesn't make any statements about this, so the devices are allowed to
do whatever seems best. Although it is undoubtedly implemented as a fixed
queue on a few devices, there are others whose queue depth depends on the
available resources (most disk arrays function this way---they tend to juggle
tag queue depth dynamically per lun).

Even if the queue depth is fixed, you have to probe it dynamically because it
will be different for each device. Even worse, on a SAN or other shared bus,
you might not be the only initiator using the device queue, so even for a
device with a fixed queue depth you don't own all the slots so the queue depth
you see varies.

The bottom line is that you have to treat the queue full return as a normal
part of I/O flow control to SCSI devices.

James

2002-02-28 15:41:31

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

Doug Gilbert prompted me to re-examine my notions about SCSI drive caching,
and sure enough the standard says (and all the drives I've looked at so far
come with) write back caching enabled by default.

Since this is a threat to the integrity of Journalling FS in power failure
situations now, I think it needs to be addressed with some urgency.

The "quick fix" would obviously be to get the sd driver to do a mode select at
probe time to turn off the WCE and RCD bits (this will place the cache into
write through mode), which would match the assumptions all the JFSs currently
make. I'll see if I can code up a quick patch to do this.

A longer term solution might be to keep the writeback cache but send down a
SYNCHRONIZE CACHE command as part of the back end completion of a barrier
write, so the fs wouldn't get a completion until the write was done and all
the dirty cache blocks flushed to the medium.

Clearly, there would also have to be a mechanism to flush the cache on
unmount, so if this were done by ioctl, would you prefer that the filesystem
be in charge of flushing the cache on barrier writes, or would you like the sd
device to do it transparently?

James

2002-02-28 16:00:53

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Thursday, February 28, 2002 09:36:52 AM -0600 James Bottomley <[email protected]> wrote:

> Doug Gilbert prompted me to re-examine my notions about SCSI drive caching,
> and sure enough the standard says (and all the drives I've looked at so far
> come with) write back caching enabled by default.

Really. Has it always been this way?

>
> Since this is a threat to the integrity of Journalling FS in power failure
> situations now, I think it needs to be addressed with some urgency.
>
> The "quick fix" would obviously be to get the sd driver to do a mode select at
> probe time to turn off the WCE and RCD bits (this will place the cache into
> write through mode), which would match the assumptions all the JFSs currently
> make. I'll see if I can code up a quick patch to do this.

Ok.

>
> A longer term solution might be to keep the writeback cache but send down a
> SYNCHRONIZE CACHE command as part of the back end completion of a barrier
> write, so the fs wouldn't get a completion until the write was done and all
> the dirty cache blocks flushed to the medium.

Right, they could just implement ORDERED_FLUSH in the barrier patch.

>
> Clearly, there would also have to be a mechanism to flush the cache on
> unmount, so if this were done by ioctl, would you prefer that the filesystem
> be in charge of flushing the cache on barrier writes, or would you like the sd
> device to do it transparently?

How about triggered by closing the block device. That would also cover
people like oracle that do stuff to the raw device.

-chris

2002-02-28 18:04:38

by Mike Anderson

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

Chris Mason [[email protected]] wrote:
>
> ..snip..
> >
> > Clearly, there would also have to be a mechanism to flush the cache on
> > unmount, so if this were done by ioctl, would you prefer that the filesystem
> > be in charge of flushing the cache on barrier writes, or would you like the sd
> > device to do it transparently?
>
> How about triggered by closing the block device. That would also cover
> people like oracle that do stuff to the raw device.
>
> -chris

Doing something in sd_release should be covered in the raw case.
raw_release->blkdev_put->bdev->bd_op->release "sd_release".

At least from what I understand of the raw release call path :-).
-Mike
--
Michael Anderson
[email protected]

2002-02-28 18:19:06

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Thursday, February 28, 2002 10:55:46 AM -0500 Chris Mason <[email protected]> wrote:

>>
>> A longer term solution might be to keep the writeback cache but send down a
>> SYNCHRONIZE CACHE command as part of the back end completion of a barrier
>> write, so the fs wouldn't get a completion until the write was done and all
>> the dirty cache blocks flushed to the medium.
>
> Right, they could just implement ORDERED_FLUSH in the barrier patch.

So, a little testing with scsi_info shows my scsi drives do have
writeback cache on. great. What's interesting is they
must be doing additional work for ordered tags. If they were treating
the block as written once in cache, using the tags should not change
performance at all. But, I can clearly show the tags changing
performance, and hear the drive write pattern change when tags are on.

-chris

2002-03-01 02:15:04

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

[email protected] said:
> So, a little testing with scsi_info shows my scsi drives do have
> writeback cache on. great. What's interesting is they must be doing
> additional work for ordered tags. If they were treating the block as
> written once in cache, using the tags should not change performance
> at all. But, I can clearly show the tags changing performance, and
> hear the drive write pattern change when tags are on.

I checked all mine and they're write through. However, I inherited all my
drives from an enterprise vendor so this might not be that surprising.

I can surmise why ordered tags kill performance on your drive, since an
ordered tag is required to affect the ordering of the write to the medium, not
the cache, it is probably implemented with an implicit cache flush.

Anyway, the attached patch against 2.4.18 (and I know it's rather gross code)
will probe the cache type and try to set it to write through on boot. See
what this does to your performance ordinarily, and also to your tagged write
barrier performance.

James

Attachments:

sd-cache.diff (3.88 kB)
sd-cache.diff

2002-03-01 15:26:53

by Dieter Nützel

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

James Bottomley wrote:
> [email protected] said:
> > So, a little testing with scsi_info shows my scsi drives do have
> > writeback cache on. great. What's interesting is they must be doing
> > additional work for ordered tags. If they were treating the block as
> > written once in cache, using the tags should not change performance
> > at all. But, I can clearly show the tags changing performance, and
> > hear the drive write pattern change when tags are on.

> I checked all mine and they're write through. However, I inherited all my
> drives from an enterprise vendor so this might not be that surprising.

How do you checked it?
Which scsi_info version?
Mine gave only the below info:

SunWave1 /home/nuetzel# scsi_info /dev/sda
SCSI_ID="0,0,0"
MODEL="IBM DDYS-T18350N"
FW_REV="S96H"
SunWave1 /home/nuetzel# scsi_info /dev/sdb
SCSI_ID="0,1,0"
MODEL="IBM DDRS-34560D"
FW_REV="DC1B"
SunWave1 /home/nuetzel# scsi_info /dev/sdc
SCSI_ID="0,2,0"
MODEL="IBM DDRS-34560W"
FW_REV="S71D"

But when I use "scsi-config" I get under "Cache Control Page":
Read cache enabled: Yes
Write cache enabled: No

I've tested it with setting this by hand some months ago, but the speed
doesn't change in anyway (ReiserFS).

> I can surmise why ordered tags kill performance on your drive, since an
> ordered tag is required to affect the ordering of the write to the medium,
> not the cache, it is probably implemented with an implicit cache flush.
>
> Anyway, the attached patch against 2.4.18 (and I know it's rather gross
> code) will probe the cache type and try to set it to write through on boot.
> See what this does to your performance ordinarily, and also to your
> tagged write barrier performance.

Will test it over the weekend on 2.4.19-pre1aa1 with all Reiserfs
2.4.18.pending patches applied.

Regards,
Dieter

2002-03-01 16:00:44

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

[email protected] said:
> How do you checked it?

I used sginfo from Doug Gilbert's sg utilities (http://www.torque.net/sg)

The version was sg3_utils-0.98

[email protected] said:
> But when I use "scsi-config" I get under "Cache Control Page": Read
> cache enabled: Yes Write cache enabled: No

I believe write cache enabled is the state of the WCE bit and read cache
enabled is the inverse of the RCD bit, so you have a write through cache.

I think that notwithstanding the spec, most drives are write through (purely
because of the safety aspect). I suspect certain manufacturers use write back
caching to try to improve performance figures (at the expense of safety).

James

2002-03-03 22:16:22

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On February 28, 2002 04:36 pm, James Bottomley wrote:
> Doug Gilbert prompted me to re-examine my notions about SCSI drive caching,
> and sure enough the standard says (and all the drives I've looked at so far
> come with) write back caching enabled by default.
>
> Since this is a threat to the integrity of Journalling FS in power failure
> situations now, I think it needs to be addressed with some urgency.
>
> The "quick fix" would obviously be to get the sd driver to do a mode select at
> probe time to turn off the WCE and RCD bits (this will place the cache into
> write through mode), which would match the assumptions all the JFSs currently
> make. I'll see if I can code up a quick patch to do this.
>
> A longer term solution might be to keep the writeback cache but send down a
> SYNCHRONIZE CACHE command as part of the back end completion of a barrier
> write, so the fs wouldn't get a completion until the write was done and all
> the dirty cache blocks flushed to the medium.

I've been following the thread, I hope I haven't missed anything fundamental.
A better long term solution is to have ordered tags work as designed. It's
not broken by design is it, just implementation?

I have a standing offer from at least one engineer to make firmware changes
to the drives if it makes Linux work better. So a reasonable plan is: first
know what's ideal, second ask for it. Coupled with that, we'd need a way of
identifying drives that don't work in the ideal way, and require a fallback.

In my opinion, the only correct behavior is a write barrier that completes
when data is on the platter, and that does this even when write-back is
enabled. Surely this is not rocket science at the disk firmware level. Is
this or is this not the way ordered tags were supposed to work?

> Clearly, there would also have to be a mechanism to flush the cache on
> unmount, so if this were done by ioctl, would you prefer that the filesystem
> be in charge of flushing the cache on barrier writes, or would you like the sd
> device to do it transparently?

The filesystem should just say 'this request is a write barrier' and the
lower layers, whether that's scsi or bio, should do what's necessary to make
it come true.

--
Daniel

2002-03-04 03:35:03

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Sunday, March 03, 2002 11:11:44 PM +0100 Daniel Phillips <[email protected]> wrote:

> I have a standing offer from at least one engineer to make firmware changes
> to the drives if it makes Linux work better. So a reasonable plan is: first
> know what's ideal, second ask for it. Coupled with that, we'd need a way of
> identifying drives that don't work in the ideal way, and require a fallback.
>
> In my opinion, the only correct behavior is a write barrier that completes
> when data is on the platter, and that does this even when write-back is
> enabled.

With a battery backup, we want the raid controller (or whatever) to
pretend the barrier is done right away. It should be as safe, and
allow the target to merge the writes.

> Surely this is not rocket science at the disk firmware level. Is
> this or is this not the way ordered tags were supposed to work?

There are many issues at play in this thread, here's an attempt at
a summary (please correct any mistakes).

1) The drivers would need to be changed to properly keep tag ordering
in place on resets, and error conditions.

2) ordered tags force ordering of all writes the drive is processing.
For some workloads, it will be forced to order stuff the journal code
doesn't care about at all, perhaps leading to lower performance than
the simple wait_on_buffer() we're using now.

2a) Are the filesystems asking for something impossible? Can drives
really write block N and N+1, making sure to commit N to media before
N+1 (including an abort on N+1 if N fails), but still keeping up a
nice seek free stream of writes?

3) Some drives may not be very smart about ordered tags. We need
to figure out which is faster, using the ordered tag or using a
simple cache flush (when writeback is on). The good news about
the cache flush is that it doesn't require major surgery in the
scsi error handlers.

4) If some scsi drives come with writeback on by default, do they also
turn it on under high load like IDE drives do?

>
>> Clearly, there would also have to be a mechanism to flush the cache on
>> unmount, so if this were done by ioctl, would you prefer that the filesystem
>> be in charge of flushing the cache on barrier writes, or would you like the sd
>> device to do it transparently?
>
> The filesystem should just say 'this request is a write barrier' and the
> lower layers, whether that's scsi or bio, should do what's necessary to make
> it come true.

That's the goal. The current 2.4 patch differentiates between ordered
barriers and flush barriers just so I can make the flush the default
on IDE, and enable the ordered stuff when I want to experiment on scsi.

-chris

2002-03-04 04:22:53

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Mar 3, 11:11pm, Daniel Phillips wrote:
> I have a standing offer from at least one engineer to make firmware changes
> to the drives if it makes Linux work better. So a reasonable plan is: first
> know what's ideal, second ask for it. Coupled with that, we'd need a way of
> identifying drives that don't work in the ideal way, and require a fallback.
>
> In my opinion, the only correct behavior is a write barrier that completes
> when data is on the platter, and that does this even when write-back is
> enabled. Surely this is not rocket science at the disk firmware level. Is
> this or is this not the way ordered tags were supposed to work?

Ordered tags just specify ordering in the command stream. The WCE bit
specifies when the write command is complete. I have never heard of
any implied requirement to flush to media when a drive receives an
ordered tag and WCE is set. It does seem like a useful feature to have
in the standard, but I don't think it's there.

So if one vendor implements those semantics, but the others don't where
does that leave us?

jeremy

2002-03-04 05:10:02

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On March 4, 2002 04:34 am, Chris Mason wrote:
> On Sunday, March 03, 2002 11:11:44 PM +0100 Daniel Phillips <[email protected]> wrote:
>
> > I have a standing offer from at least one engineer to make firmware changes
> > to the drives if it makes Linux work better. So a reasonable plan is: first
> > know what's ideal, second ask for it. Coupled with that, we'd need a way of
> > identifying drives that don't work in the ideal way, and require a fallback.
> >
> > In my opinion, the only correct behavior is a write barrier that completes
> > when data is on the platter, and that does this even when write-back is
> > enabled.
>
> With a battery backup, we want the raid controller (or whatever) to
> pretend the barrier is done right away. It should be as safe, and
> allow the target to merge the writes.

Agreed, that should count as 'on the platter'. Unless the battery is flat...

> > Surely this is not rocket science at the disk firmware level. Is
> > this or is this not the way ordered tags were supposed to work?
>
> There are many issues at play in this thread, here's an attempt at
> a summary (please correct any mistakes).
>
> 1) The drivers would need to be changed to properly keep tag ordering
> in place on resets, and error conditions.

Linux drivers? Isn't that a simple matter of coding? ;-)

> 2) ordered tags force ordering of all writes the drive is processing.
> For some workloads, it will be forced to order stuff the journal code
> doesn't care about at all, perhaps leading to lower performance than
> the simple wait_on_buffer() we're using now.

OK, thanks for the clear definition of the problem. This corresponds
to my reading of this document:

http://www.storage.ibm.com/hardsoft/products/ess/pubs/f2ascsi1.pdf

Ordered Queue Tag:

The command begins execution after all previously issued commands
complete. Subsequent commands may not begin execution until this
command completes (unless they are issued with Head of Queue tag
messages).

But chances are, almost all the IOs ahead of the journal commit belong
to your same filesystem anyway, so you may be worrying too much about
possibly waiting for something on another partition.

In theory, bio could notice the barrier coming down the pipe and hold
back commands on other partitions, if they're too far away physically.

> 2a) Are the filesystems asking for something impossible? Can drives
> really write block N and N+1, making sure to commit N to media before
> N+1 (including an abort on N+1 if N fails), but still keeping up a
> nice seek free stream of writes?
>
> 3) Some drives may not be very smart about ordered tags. We need
> to figure out which is faster, using the ordered tag or using a
> simple cache flush (when writeback is on). The good news about
> the cache flush is that it doesn't require major surgery in the
> scsi error handlers.

Everything else seems to be getting major surgery these days, so...

> 4) If some scsi drives come with writeback on by default, do they also
> turn it on under high load like IDE drives do?

It shouldn't matter, if the ordered queue tag is implemented properly.
>From the thread I gather it isn't always, which means we need a
blacklist, or putting on a happier face, a whitelist.

> >> Clearly, there would also have to be a mechanism to flush the cache on
> >> unmount, so if this were done by ioctl, would you prefer that the filesystem
> >> be in charge of flushing the cache on barrier writes, or would you like the sd
> >> device to do it transparently?
> >
> > The filesystem should just say 'this request is a write barrier' and the
> > lower layers, whether that's scsi or bio, should do what's necessary to make
> > it come true.
>
> That's the goal. The current 2.4 patch differentiates between ordered
> barriers and flush barriers just so I can make the flush the default
> on IDE, and enable the ordered stuff when I want to experiment on scsi.

I should state it more precisely: 'this request is a write barrier for this
partition'. Is that what you had in mind?

--
Daniel

2002-03-04 05:36:15

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On March 4, 2002 05:21 am, Jeremy Higdon wrote:
> On Mar 3, 11:11pm, Daniel Phillips wrote:
> > I have a standing offer from at least one engineer to make firmware changes
> > to the drives if it makes Linux work better. So a reasonable plan is: first
> > know what's ideal, second ask for it. Coupled with that, we'd need a way of
> > identifying drives that don't work in the ideal way, and require a fallback.
> >
> > In my opinion, the only correct behavior is a write barrier that completes
> > when data is on the platter, and that does this even when write-back is
> > enabled. Surely this is not rocket science at the disk firmware level. Is
> > this or is this not the way ordered tags were supposed to work?
>
> Ordered tags just specify ordering in the command stream. The WCE bit
> specifies when the write command is complete.

WCE is per-command? And 0 means no caching, so the command must complete
when the data is on the media?

> I have never heard of
> any implied requirement to flush to media when a drive receives an
> ordered tag and WCE is set. It does seem like a useful feature to have
> in the standard, but I don't think it's there.

It seems to be pretty strongly implied that things should work that way.
What is the use of being sure the write with the ordered tag is on media
if you're not sure about the writes that were supposedly supposed to
precede it? Spelling this out would indeed be helpful.

> So if one vendor implements those semantics, but the others don't where
> does that leave us?

It leaves us with a vendor we want to buy our drives from, if we want our
data to be safe.

--
Daniel

2002-03-04 06:11:20

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Mar 4, 6:31am, Daniel Phillips wrote:
> On March 4, 2002 05:21 am, Jeremy Higdon wrote:
> > On Mar 3, 11:11pm, Daniel Phillips wrote:
> > > I have a standing offer from at least one engineer to make firmware changes
> > > to the drives if it makes Linux work better. So a reasonable plan is: first
> > > know what's ideal, second ask for it. Coupled with that, we'd need a way of
> > > identifying drives that don't work in the ideal way, and require a fallback.
> > >
> > > In my opinion, the only correct behavior is a write barrier that completes
> > > when data is on the platter, and that does this even when write-back is
> > > enabled. Surely this is not rocket science at the disk firmware level. Is
> > > this or is this not the way ordered tags were supposed to work?
> >
> > Ordered tags just specify ordering in the command stream. The WCE bit
> > specifies when the write command is complete.
>
> WCE is per-command? And 0 means no caching, so the command must complete
> when the data is on the media?

My reading is that WCE==1 means that the command is complete when the
data is in the drive buffer.

> > I have never heard of
> > any implied requirement to flush to media when a drive receives an
> > ordered tag and WCE is set. It does seem like a useful feature to have
> > in the standard, but I don't think it's there.
>
> It seems to be pretty strongly implied that things should work that way.
> What is the use of being sure the write with the ordered tag is on media
> if you're not sure about the writes that were supposedly supposed to
> precede it? Spelling this out would indeed be helpful.

WCE==1 and ordered tag means that the data for previous commands is in
the drive buffer before the data for the ordered tag is in the drive
buffer.

> > So if one vendor implements those semantics, but the others don't where
> > does that leave us?
>
> It leaves us with a vendor we want to buy our drives from, if we want our
> data to be safe.

The point is, do you write code that depends on one vendor's interpretation?
If so, then the vendor needs to be identified. Perhaps other vendors will
then align themselves.

> Daniel

jeremy

2002-03-04 08:01:56

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On March 4, 2002 07:09 am, Jeremy Higdon wrote:
> On Mar 4, 6:31am, Daniel Phillips wrote:
> > On March 4, 2002 05:21 am, Jeremy Higdon wrote:
> > > I have never heard of
> > > any implied requirement to flush to media when a drive receives an
> > > ordered tag and WCE is set. It does seem like a useful feature to have
> > > in the standard, but I don't think it's there.
> >
> > It seems to be pretty strongly implied that things should work that way.
> > What is the use of being sure the write with the ordered tag is on media
> > if you're not sure about the writes that were supposedly supposed to
> > precede it? Spelling this out would indeed be helpful.
>
> WCE==1 and ordered tag means that the data for previous commands is in
> the drive buffer before the data for the ordered tag is in the drive
> buffer.

Right, and what we're talking about is going further and requiring that WCE=0
and ordered tag means the data for previous commands is *not* in the buffer,
i.e., on the platter, which is the only interpretation that makes sense.

> > > So if one vendor implements those semantics, but the others don't where
> > > does that leave us?
> >
> > It leaves us with a vendor we want to buy our drives from, if we want our
> > data to be safe.
>
> The point is, do you write code that depends on one vendor's interpretation?

Yes, that's the idea. And we need some way of knowing which vendors have
interpreted the scsi spec in the way that maximizes both throughput and
safety. That's the 'whitelist'.

> If so, then the vendor needs to be identified. Perhaps other vendors will
> then align themselves.

I'm sure they will.

--
Daniel

2002-03-04 08:21:09

by Helge Hafting

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Sun, Mar 03, 2002 at 10:34:07PM -0500, Chris Mason wrote:
[...]
> 3) Some drives may not be very smart about ordered tags. We need
> to figure out which is faster, using the ordered tag or using a
> simple cache flush (when writeback is on). The good news about
> the cache flush is that it doesn't require major surgery in the
> scsi error handlers.

Isn't that a userspace thing? I.e. use ordered tags in the best
way possible for drives that _are_ smart about ordered tags.
Let the admin change that with a hdparm-like utility
if testing (or specs) confirms that this particular
drive takes a performance hit.

I thing the days of putting up with any stupid hw is
slowly going away. Linux is a serious server os these
days, and disk makers will be smart about ordered tags
if some server os do benefit from it. It won't
really cost them much either.

Old hw is another story of course - some sort of
fallback might be useful for that. But probably
not for next year's drives. :-)

Helge Hafting

2002-03-04 14:48:40

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

[email protected] said:
> I've been following the thread, I hope I haven't missed anything
> fundamental. A better long term solution is to have ordered tags work
> as designed. It's not broken by design is it, just implementation?

There is actually one hole in the design: A scsi device may accept a command
with an ordered tag, disconnect and at a later time reconnect and return a
QUEUE FULL status indicating that the tag must be retried. In the time
between the disconnect and reconnect, the standard doesn't require that no
other tags be accepted, so if the local flow control conditions abate, the
device is allowed to accept and execute a tag sent down in between the
disconnect and reconnect.

I think this would introduce a very minor deviation where one tag could
overtake another, but we may still get a useable implementation even with this.

James

2002-03-04 14:58:20

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

[email protected] said:
> 1) The drivers would need to be changed to properly keep tag ordering
> in place on resets, and error conditions.

And actually theres also a problem with normal operations that disrupt flow
control like QUEUE FULL returns and contingent allegiance conditions.

Basically, neither the SCSI mid-layer nor the low level drivers were designed
to keep absolute command ordering. They take the chaotic I/O approach: you
give me a bunch of commands and I tell you when they complete.

> 2) ordered tags force ordering of all writes the drive is processing.
> For some workloads, it will be forced to order stuff the journal code
> doesn't care about at all, perhaps leading to lower performance than
> the simple wait_on_buffer() we're using now.

> 2a) Are the filesystems asking for something impossible? Can drives
> really write block N and N+1, making sure to commit N to media before
> N+1 (including an abort on N+1 if N fails), but still keeping up a
> nice seek free stream of writes?

These are the "big" issues. There's not much point doing all the work to
implement ordered tags, if the end result is going to be no gain in
performance.

> 4) If some scsi drives come with writeback on by default, do they also
> turn it on under high load like IDE drives do?

Finally, an easy one...the answer's "no". The cache control bits are the only
way to alter caching behaviour (nothing stops a WCE=1 operating as write
through if the drive wants to, but a WCE=0 cannot operate write back).

James

2002-03-04 15:04:00

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

[email protected] said:
> But chances are, almost all the IOs ahead of the journal commit belong
> to your same filesystem anyway, so you may be worrying too much about
> possibly waiting for something on another partition.

My impression is that most modern JFS can work on multiple transactions
simultaneously. All you really care about, I believe, is I/O ordering within
the transaction. However, separate transactions have no I/O ordering
requirements with respect to each other (unless they actually overlap). Using
ordered tags imposes a global ordering, not just a local transaction ordering,
so they may not be the most appropriate way to ensure the ordering of writes
within a single transaction.

I'm not really a JFS expert, so perhaps those who actually develop these
filesystems could comment?

James

2002-03-04 16:53:14

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

Hi,

On Sun, Mar 03, 2002 at 10:09:35PM -0800, Jeremy Higdon wrote:

> > WCE is per-command? And 0 means no caching, so the command must complete
> > when the data is on the media?
>
> My reading is that WCE==1 means that the command is complete when the
> data is in the drive buffer.

Even if WCE is enabled in the caching mode page, we can still set FUA
(Force Unit Access) in individual write commands to force platter
completion before commands complete.

Of course, it's a good question whether this is honoured properly on
all drives.

FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.

--Stephen

2002-03-04 17:05:15

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

Hi,

On Mon, Mar 04, 2002 at 09:03:31AM -0600, James Bottomley wrote:
> [email protected] said:
> > But chances are, almost all the IOs ahead of the journal commit belong
> > to your same filesystem anyway, so you may be worrying too much about
> > possibly waiting for something on another partition.
>
> My impression is that most modern JFS can work on multiple transactions
> simultaneously. All you really care about, I believe, is I/O ordering within
> the transaction. However, separate transactions have no I/O ordering
> requirements with respect to each other (unless they actually overlap).

Generally, that may be true but it's irrelevant. Internally, the fs
may keep transactions as independent, but as soon as IO is scheduled,
those transactions become serialised. Given that pure sequential IO
is so much more efficient than random IO, we usually expect
performance to be improved, not degraded, by such serialisation.

I don't know of any filesystems which will be able to recover a
transaction X+1 if transaction X is not complete in the log. Once you
start writing, the transactions lose their independence.

> Using
> ordered tags imposes a global ordering, not just a local transaction ordering,
Actually, ordered tags are in many cases not global enough. LVM, for
example.

Basically, as far as journal writes are concerned, you just want
things sequential for performance, so serialisation isn't a problem
(and it typically happens anyway). After the journal write, the
eventual proper writeback of the dirty data to disk has no internal
ordering requirement at all --- it just needs to start strictly after
the commit, and end before the journal records get reused. Beyond
that, the write order for the writeback data is irrelevant.

Cheers,
Stephen

2002-03-04 17:17:46

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Monday, March 04, 2002 05:04:34 PM +0000 "Stephen C. Tweedie" <[email protected]> wrote:

> Basically, as far as journal writes are concerned, you just want
> things sequential for performance, so serialisation isn't a problem
> (and it typically happens anyway). After the journal write, the
> eventual proper writeback of the dirty data to disk has no internal
> ordering requirement at all --- it just needs to start strictly after
> the commit, and end before the journal records get reused. Beyond
> that, the write order for the writeback data is irrelevant.
>

writeback data order is important, mostly because of where the data blocks
are in relation to the log. If you've got bdflush unloading data blocks
to the disk, and another process doing a commit, the drive's queue
might look like this:

data1, data2, data3, commit1, data4, data5 etc.

If commit1 is an ordered tag, the drive is required to flush
data1, data2 and data3, then write the commit, then seek back
for data4 and data5.

If commit1 is not an ordered tag, the drive can write all the
data blocks, then seek back to get the commit.

-chris

2002-03-04 17:25:36

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Monday, March 04, 2002 08:57:57 AM -0600 James Bottomley <[email protected]> wrote:

>> 2) ordered tags force ordering of all writes the drive is processing.
>> For some workloads, it will be forced to order stuff the journal code
>> doesn't care about at all, perhaps leading to lower performance than
>> the simple wait_on_buffer() we're using now.
>
>> 2a) Are the filesystems asking for something impossible? Can drives
>> really write block N and N+1, making sure to commit N to media before
>> N+1 (including an abort on N+1 if N fails), but still keeping up a
>> nice seek free stream of writes?
>
> These are the "big" issues. There's not much point doing all the work to
> implement ordered tags, if the end result is going to be no gain in
> performance.

Right, 2a seems to be the show stopper to me. The good news is
the existing patches are enough to benchmark the thing and see if
any devices actually benefit. If we find enough that do, then it
might be worth the extra driver coding required to make the code
correct.

>
>> 4) If some scsi drives come with writeback on by default, do they also
>> turn it on under high load like IDE drives do?
>
> Finally, an easy one...the answer's "no". The cache control bits are the only
> way to alter caching behaviour (nothing stops a WCE=1 operating as write
> through if the drive wants to, but a WCE=0 cannot operate write back).

good to hear, thanks.

-chris

2002-03-04 17:37:54

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

[email protected] said:
> Generally, that may be true but it's irrelevant. Internally, the fs
> may keep transactions as independent, but as soon as IO is scheduled,
> those transactions become serialised. Given that pure sequential IO
> is so much more efficient than random IO, we usually expect
> performance to be improved, not degraded, by such serialisation.

This is the part I'm struggling with. Even without error handling and certain
other changes that would have to be made to give guaranteed integrity to the
tag ordering, Chris' patch is a very reasonable experimental model of how an
optimal system for implementing write barriers via ordered tags would work;
yet when he benchmarks, he sees a performance decrease.

I can dismiss his results as being due to firmware problems with his drives
making them behave non-optimally for ordered tags, but I really would like to
see evidence that someone somewhere acutally sees a performance boost with
Chris' patch.

Have there been any published comparisons of a write barrier implementation
verses something like the McKusick soft update idea, or even just
multi-threaded back end completion of the transactions?

James

2002-03-04 17:49:59

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Monday, March 04, 2002 11:35:24 AM -0600 James Bottomley <[email protected]> wrote:

> [email protected] said:
>> Generally, that may be true but it's irrelevant. Internally, the fs
>> may keep transactions as independent, but as soon as IO is scheduled,
>> those transactions become serialised. Given that pure sequential IO
>> is so much more efficient than random IO, we usually expect
>> performance to be improved, not degraded, by such serialisation.
>
> This is the part I'm struggling with. Even without error handling and certain
> other changes that would have to be made to give guaranteed integrity to the
> tag ordering, Chris' patch is a very reasonable experimental model of how an
> optimal system for implementing write barriers via ordered tags would work;
> yet when he benchmarks, he sees a performance decrease.
>

Actually most tests I've done show no change at all. So far, only
lots of O_SYNC writes stress the log enough to show a performance
difference, about 10% faster with tags on.

> I can dismiss his results as being due to firmware problems with his drives
> making them behave non-optimally for ordered tags, but I really would like to
> see evidence that someone somewhere acutally sees a performance boost with
> Chris' patch.

So would I ;-)

>
> Have there been any published comparisons of a write barrier implementation
> verses something like the McKusick soft update idea, or even just
> multi-threaded back end completion of the transactions?

Sorry, what do you mean by multi-threaded back end completion of the
transaction?

-chris

2002-03-04 18:06:24

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

Hi,

On Mon, Mar 04, 2002 at 12:16:35PM -0500, Chris Mason wrote:

> writeback data order is important, mostly because of where the data blocks
> are in relation to the log. If you've got bdflush unloading data blocks
> to the disk, and another process doing a commit, the drive's queue
> might look like this:
>
> data1, data2, data3, commit1, data4, data5 etc.
>
> If commit1 is an ordered tag, the drive is required to flush
> data1, data2 and data3, then write the commit, then seek back
> for data4 and data5.

Yes, but that's a performance issue, not a correctness one.

Also, as soon as we have journals on external devices, this whole
thing changes entirely. We still have to enforce the commit ordering
in the journal, but we also still need the ordering between that
commit and any subsequent writeback, and that obviousy can no longer
be achieved via ordered tags if the two writes are happening on
different devices.

--Stephen

2002-03-04 18:10:24

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

Hi,

On Mon, Mar 04, 2002 at 11:35:24AM -0600, James Bottomley wrote:

> Have there been any published comparisons of a write barrier implementation
> verses something like the McKusick soft update idea

Soft updates are just another mechanism of doing ordered writes. If
the disk IO subsystem is lying about write ordering or is doing
unexpected writeback caching, soft updates are no more of a cure than
journaling.

> or even just
> multi-threaded back end completion of the transactions?

ext3 already does the on-disk transaction complete asynchronously
within a separate kjournald thread, independent of writeback IO going
on in the VM's own writeback threads. Given that it is kernel code
given full access to the kernel's internal lazy IO completion
mechanisms, I'm not sure that it can usefully be given any more
threading. I think the reiserfs situation is similar.

--Stephen

2002-03-04 18:12:04

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

[email protected] said:
> Sorry, what do you mean by multi-threaded back end completion of the
> transaction?

It's an old idea from databases with fine grained row level locking. To alter
data in a single row, you reserve space in the rollback log, take the row
lock, write the transaction description, write the data, undo the transaction
description and release the rollback log space and row lock. These actions
are sequential, but there may be many such transactions going on in the table
simultaneously. The way I've seen a database do this is to set up the actions
as linked threads which are run as part of the completion routine of the
previous thread. Thus, you don't need to wait for the update to complete, you
just kick off the transaction. You are prevented from stepping on your own
transaction because if you want to alter the same row again you have to wait
for the row lock to be released. The row locks are the "barriers" in this
case, but they preserve the concept of transaction independence. Of course,
most DB transactions involve many row locks and you don't even want to get
into what the deadlock detection algorithms look like...

I always imagined a journalled filesystem worked something like this, since
most of the readers/writers will be acting independently there shouldn't be so
much deadlock potential.

James

2002-03-04 18:20:04

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On March 4, 2002 05:52 pm, Stephen C. Tweedie wrote:
> Hi,
>
> On Sun, Mar 03, 2002 at 10:09:35PM -0800, Jeremy Higdon wrote:
>
> > > WCE is per-command? And 0 means no caching, so the command must complete
> > > when the data is on the media?
> >
> > My reading is that WCE==1 means that the command is complete when the
> > data is in the drive buffer.
>
> Even if WCE is enabled in the caching mode page, we can still set FUA
> (Force Unit Access) in individual write commands to force platter
> completion before commands complete.

Yes, I discovered the FUA bit just after making the previous post, so please
substitute 'FUA' for 'WCE' in the above.

> Of course, it's a good question whether this is honoured properly on
> all drives.
>
> FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.

I'm having a little trouble seeing the difference between WRITE10, WRITE12
and WRITE16. WRITE6 seems to be different only in not garaunteeing to
support the FUA (and one other) bit. I'm reading the Scsi Block Commands
2 pdf:

ftp://ftp.t10.org/t10/drafts/sbc2/sbc2r05a.pdf

(Side note: how nice it would be if t10.org got a clue and posted their
docs in html, in addition to the inconvenient, unhyperlinked, proprietary
format pdfs.)

--
Daniel

2002-03-04 18:28:45

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

[email protected] said:
> Also, as soon as we have journals on external devices, this whole
> thing changes entirely. We still have to enforce the commit ordering
> in the journal, but we also still need the ordering between that
> commit and any subsequent writeback, and that obviousy can no longer
> be achieved via ordered tags if the two writes are happening on
> different devices.

Yes, that's a killer: ordered tags aren't going to be able to enforce cross
device write barriers.

There is one remaining curiosity I have, at least about the benchmarks: Since
the linux elevator and tag queueing perform essentially similar function
(except that the disk itself has a better notion of ordering because it knows
its own geometry). Might we get better performance by reducing the number of
tags we allow the device to use, thus forcing the writes to remain longer in
the linux elevator?

James

2002-03-04 18:42:17

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Monday, March 04, 2002 12:11:27 PM -0600 James Bottomley <[email protected]> wrote:

> [email protected] said:
>> Sorry, what do you mean by multi-threaded back end completion of the
>> transaction?
>
> It's an old idea from databases with fine grained row level locking. To alter
> data in a single row, you reserve space in the rollback log, take the row
> lock, write the transaction description, write the data, undo the transaction
> description and release the rollback log space and row lock. These actions
> are sequential, but there may be many such transactions going on in the table
> simultaneously. The way I've seen a database do this is to set up the actions
> as linked threads which are run as part of the completion routine of the
> previous thread. Thus, you don't need to wait for the update to complete, you
> just kick off the transaction.

Ok, then, like sct said, we try really hard to have external threads
do log io for us. It also helps that an atomic unit usually isn't
as small as 'mkdir p'. Many operations get batched together to
reduce log overhead.

-chris

2002-03-04 19:07:10

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On March 4, 2002 06:24 pm, Chris Mason wrote:
> On Monday, March 04, 2002 08:57:57 AM -0600 James Bottomley wrote:
> >> 2a) Are the filesystems asking for something impossible? Can drives
> >> really write block N and N+1, making sure to commit N to media before
> >> N+1 (including an abort on N+1 if N fails), but still keeping up a
> >> nice seek free stream of writes?
> >
> > These are the "big" issues. There's not much point doing all the work to
> > implement ordered tags, if the end result is going to be no gain in
> > performance.
>
> Right, 2a seems to be the show stopper to me. The good news is
> the existing patches are enough to benchmark the thing and see if
> any devices actually benefit. If we find enough that do, then it
> might be worth the extra driver coding required to make the code
> correct.

Waiting with breathless anticipation. And once these issues are worked out,
there's a tough one remaining: enforcing the write barrier through a virtual
volume with multiple spindles underneath with separate command queues, so
that the write barrier applies to all.

--
Daniel

2002-03-04 19:52:40

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On March 4, 2002 07:05 pm, Stephen C. Tweedie wrote:
> Hi,
>
> On Mon, Mar 04, 2002 at 12:16:35PM -0500, Chris Mason wrote:
>
> > writeback data order is important, mostly because of where the data blocks
> > are in relation to the log. If you've got bdflush unloading data blocks
> > to the disk, and another process doing a commit, the drive's queue
> > might look like this:
> >
> > data1, data2, data3, commit1, data4, data5 etc.
> >
> > If commit1 is an ordered tag, the drive is required to flush
> > data1, data2 and data3, then write the commit, then seek back
> > for data4 and data5.
>
> Yes, but that's a performance issue, not a correctness one.
>
> Also, as soon as we have journals on external devices, this whole
> thing changes entirely. We still have to enforce the commit ordering
> in the journal, but we also still need the ordering between that
> commit and any subsequent writeback, and that obviousy can no longer
> be achieved via ordered tags if the two writes are happening on
> different devices.

But the bio layer can manage it, by sending a write barrier down all relevant
queues. We can send a zero length write barrier command, yes?

--
Daniel

2002-03-04 19:55:40

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On March 4, 2002 06:16 pm, Chris Mason wrote:
> On Monday, March 04, 2002 05:04:34 PM +0000 "Stephen C. Tweedie" <[email protected]> wrote:
>
> > Basically, as far as journal writes are concerned, you just want
> > things sequential for performance, so serialisation isn't a problem
> > (and it typically happens anyway). After the journal write, the
> > eventual proper writeback of the dirty data to disk has no internal
> > ordering requirement at all --- it just needs to start strictly after
> > the commit, and end before the journal records get reused. Beyond
> > that, the write order for the writeback data is irrelevant.
>
> writeback data order is important, mostly because of where the data blocks
> are in relation to the log. If you've got bdflush unloading data blocks
> to the disk, and another process doing a commit, the drive's queue
> might look like this:
>
> data1, data2, data3, commit1, data4, data5 etc.
>
> If commit1 is an ordered tag, the drive is required to flush
> data1, data2 and data3, then write the commit, then seek back
> for data4 and data5.
>
> If commit1 is not an ordered tag, the drive can write all the
> data blocks, then seek back to get the commit.

We can have more than one queue per device I think. Then we can have reads
unaffected by write barriers, for example. It never makes sense for a the
write barrier to wait on a read.

--
Daniel

2002-03-04 19:56:30

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

Hi,

On Mon, Mar 04, 2002 at 12:28:19PM -0600, James Bottomley wrote:

> There is one remaining curiosity I have, at least about the benchmarks: Since
> the linux elevator and tag queueing perform essentially similar function
> (except that the disk itself has a better notion of ordering because it knows
> its own geometry). Might we get better performance by reducing the number of
> tags we allow the device to use, thus forcing the writes to remain longer in
> the linux elevator?

Possibly, but my gut feeling says no and so do any benchmarks I've
seen regarding queue depths on adaptec controllers (not that I've seen
many). For relatively-closeby IOs, the disk will always have a better
idea of how to optimise a number of IOs than the Linux elevator can
have, especially if we have multiple IOs spanning multiple heads
within a single cylinder.

--Stephen

2002-03-04 19:58:20

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

Hi,

On Mon, Mar 04, 2002 at 08:48:02PM +0100, Daniel Phillips wrote:
> On March 4, 2002 07:05 pm, Stephen C. Tweedie wrote:
> > Also, as soon as we have journals on external devices, this whole
> > thing changes entirely. We still have to enforce the commit ordering
> > in the journal, but we also still need the ordering between that
> > commit and any subsequent writeback, and that obviousy can no longer
> > be achieved via ordered tags if the two writes are happening on
> > different devices.
>
> But the bio layer can manage it, by sending a write barrier down all relevant
> queues. We can send a zero length write barrier command, yes?

Sort of --- there are various flush commands we can use. However, bio
can't just submit the barriers, it needs to synchronise them, and that
means doing a global wait over all the devices until they have all
acked their barrier op. That's expensive: you may be as well off just
using the current fs-internal synchronous commands at that point.

--Stephen

2002-03-04 21:11:00

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On March 4, 2002 08:57 pm, Stephen C. Tweedie wrote:
> Hi,
>
> On Mon, Mar 04, 2002 at 08:48:02PM +0100, Daniel Phillips wrote:
> > On March 4, 2002 07:05 pm, Stephen C. Tweedie wrote:
> > > Also, as soon as we have journals on external devices, this whole
> > > thing changes entirely. We still have to enforce the commit ordering
> > > in the journal, but we also still need the ordering between that
> > > commit and any subsequent writeback, and that obviousy can no longer
> > > be achieved via ordered tags if the two writes are happening on
> > > different devices.
> >
> > But the bio layer can manage it, by sending a write barrier down all
> > relevant queues. We can send a zero length write barrier command, yes?
>
> Sort of --- there are various flush commands we can use. However, bio
> can't just submit the barriers, it needs to synchronise them, and that
> means doing a global wait over all the devices until they have all
> acked their barrier op. That's expensive: you may be as well off just
> using the current fs-internal synchronous commands at that point.

With ordered tags, at least we get the benefit of not having to wait on all
the commands before the write barrier.

It's annoying to have to let the all the command queues empty, but it's hard
to see what can be done about that, the synchronization *has* to be global.
In this case, all we can do is to be sure to respond quickly to the command
completion interrupt. So the unavoidable cost is one request's worth of bus
transfer (is there an advantage in trying to make it a small request?) and
the latency of the interrupt. 100 uSec?

In the meantime, if I am right about being able to have multiple queues per
disk, reads can continue. It's not so bad.

The only way I can imagine of improving this is if there's a way to queue
some commands on the understanding they're not to be carried out until the
word is given. My scsi-fu is not great enough to know if there's a way to do
this. Even if we could, it's probably not worth the effort, because all the
drives will have to wait for the slowest/most loaded anyway.

That's life.

--
Daniel

2002-03-04 21:35:28

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

Hi,

On Mon, Mar 04, 2002 at 12:11:27PM -0600, James Bottomley wrote:

> The way I've seen a database do this is to set up the actions
> as linked threads which are run as part of the completion routine of the
> previous thread. Thus, you don't need to wait for the update to complete, you
> just kick off the transaction. You are prevented from stepping on your own
> transaction because if you want to alter the same row again you have to wait
> for the row lock to be released. The row locks are the "barriers" in this
> case, but they preserve the concept of transaction independence.

Right, but in the database world we are usually doing synchronous
transactions, so allowing the writeback to be done in parallel is
important; and typically there's a combination of undo and redo
logging, so there is a much more relaxed ordering requirement on the
main data writes.

In filesystems it's much more common just to use redo logging, so we
can't do any file writes before the journal commit; and the IO is
usually done as writeback after the application's syscall has
finished.

Linux already has such fine-grained locking for the actual completion
of the filesystem operations, and in the journaling case,
coarse-grained writeback is usually done because it's far more
efficient to be able to batch up a bunch of updates into a single
transaction in the redo log.

There are some exceptions. GFS, for example, takes care to maintain
transactional fine grainedness even for writeback, because in a
distributed filesystem you have to be able to release pinned metadata
back to another node on demand as quickly as possible, and you don't
want to force huge compound transactions out to disk when doing so.

Cheers,
Stephen

2002-03-05 07:11:45

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Mar 4, 8:57am, Daniel Phillips wrote:
>
> On March 4, 2002 07:09 am, Jeremy Higdon wrote:
> > On Mar 4, 6:31am, Daniel Phillips wrote:
> > > On March 4, 2002 05:21 am, Jeremy Higdon wrote:
> > > > I have never heard of
> > > > any implied requirement to flush to media when a drive receives an
> > > > ordered tag and WCE is set. It does seem like a useful feature to have
> > > > in the standard, but I don't think it's there.
> > >
> > > It seems to be pretty strongly implied that things should work that way.
> > > What is the use of being sure the write with the ordered tag is on media
> > > if you're not sure about the writes that were supposedly supposed to
> > > precede it? Spelling this out would indeed be helpful.
> >
> > WCE==1 and ordered tag means that the data for previous commands is in
> > the drive buffer before the data for the ordered tag is in the drive
> > buffer.
>
> Right, and what we're talking about is going further and requiring that WCE=0
> and ordered tag means the data for previous commands is *not* in the buffer,
> i.e., on the platter, which is the only interpretation that makes sense.

Sorry to be slow here, but if WCE=0, then commands aren't complete until
data is on the media, so since previous commands don't complete until
data is on the media, and they must complete before the ordered tag
command does, what you say would have to be the case. I thought the idea
was to buffer commands to drive memory (so that the drive could increase
performance by writing back to back commands without losing a rev) and
then issue a command with a "flush" side effect.

Here is an interesting question. If you use WCE=1 and then send an
ordered tag with FUA=1, does that imply that data from previous
write commands is flushed to media? I don't think so, though it
would be a useful feature if it did.

jeremy

2002-03-05 07:25:02

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Mar 4, 8:57am, James Bottomley wrote:
>
> > 2a) Are the filesystems asking for something impossible? Can drives
> > really write block N and N+1, making sure to commit N to media before
> > N+1 (including an abort on N+1 if N fails), but still keeping up a
> > nice seek free stream of writes?
>
> These are the "big" issues. There's not much point doing all the work to
> implement ordered tags, if the end result is going to be no gain in
> performance.

If a drive does reduced latency writes, then blocks can be written out
of order. Also, for a trivial case: with hardware RAIDs, when the
data for a single command is split across multiple drives, you can get
data blocks written out of order, no matter what you do.

I don't think a filesystem can make any assumptions about blocks within
a single command, though with ordered tags (assuming driver and device
support) and no write caching, it can make assumptions between commands.

jeremy

2002-03-05 07:41:26

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Mon, Mar 04 2002, Daniel Phillips wrote:
> > FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.
>
> I'm having a little trouble seeing the difference between WRITE10, WRITE12
> and WRITE16. WRITE6 seems to be different only in not garaunteeing to
> support the FUA (and one other) bit. I'm reading the Scsi Block Commands

WRITE6 was deprecated because there is only one byte available to set
transfer size. Enter WRITE10. WRITE12 allows the use of the streaming
performance settings, that's the only functional difference wrt WRITE10
iirc.

> (Side note: how nice it would be if t10.org got a clue and posted their
> docs in html, in addition to the inconvenient, unhyperlinked, proprietary
> format pdfs.)

See the mtfuji docs as an example for how nicely pdf's can be setup too.
The thought of substituting that for a html version makes me want to
barf.

--
Jens Axboe

2002-03-05 07:42:56

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Mon, Mar 04 2002, Daniel Phillips wrote:
> > writeback data order is important, mostly because of where the data blocks
> > are in relation to the log. If you've got bdflush unloading data blocks
> > to the disk, and another process doing a commit, the drive's queue
> > might look like this:
> >
> > data1, data2, data3, commit1, data4, data5 etc.
> >
> > If commit1 is an ordered tag, the drive is required to flush
> > data1, data2 and data3, then write the commit, then seek back
> > for data4 and data5.
> >
> > If commit1 is not an ordered tag, the drive can write all the
> > data blocks, then seek back to get the commit.
>
> We can have more than one queue per device I think. Then we can have reads
> unaffected by write barriers, for example. It never makes sense for a the
> write barrier to wait on a read.

No, there will always be at most one queue for a device. There might be
more than one device on a queue, though, so yes the implementation at
the block/queue level still leaves something to be desired.

--
Jens Axboe

2002-03-05 07:49:24

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Mon, Mar 04 2002, Daniel Phillips wrote:
> But the bio layer can manage it, by sending a write barrier down all relevant
> queues. We can send a zero length write barrier command, yes?

Actually, yes that was indeed one of the things I wanted to achieve with
the block layer rewrite -- the ability to send down other commands than
read/write down the queue. So not exactly bio, but more of a new block
feature.

See, now fs requests have REQ_CMD set in the request flag bits. This
means that it's a "regular" request, which has a string of bios attached
to it. Doing something ala

struct request *rq = get_request();

init_request(rq);
rq->rq_dev = target_dev;
rq->cmd[0] = GPCMD_FLUSH_CACHE;
rq->flags = REQ_PC;
/* additional info... */
queue_request(rq);

would indeed be possible. The attentive reader will now know where
ide-scsi is headed and why :-)

This would work for any SCSI and psueo-SCSI device, basically all the
stuff out there. For IDE, the request pre-handler would transform this
into an IDE command (or taskfile).

--
Jens Axboe

2002-03-05 14:59:37

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

Hi,

On Mon, Mar 04, 2002 at 10:06:19PM +0100, Daniel Phillips wrote:
> On March 4, 2002 08:57 pm, Stephen C. Tweedie wrote:
> > On Mon, Mar 04, 2002 at 08:48:02PM +0100, Daniel Phillips wrote:
> > > On March 4, 2002 07:05 pm, Stephen C. Tweedie wrote:
> > > > Also, as soon as we have journals on external devices, this whole
> > > > thing changes entirely.

> > > We can send a zero length write barrier command, yes?
> >
> > Sort of --- there are various flush commands we can use. However, bio
> > can't just submit the barriers, it needs to synchronise them, and that
> > means doing a global wait over all the devices until they have all
> > acked their barrier op. That's expensive: you may be as well off just
> > using the current fs-internal synchronous commands at that point.
>
> With ordered tags, at least we get the benefit of not having to wait on all
> the commands before the write barrier.
>
> It's annoying to have to let the all the command queues empty, but it's hard
> to see what can be done about that, the synchronization *has* to be global.
> In this case, all we can do is to be sure to respond quickly to the command
> completion interrupt. So the unavoidable cost is one request's worth of bus
> transfer (is there an advantage in trying to make it a small request?) and
> the latency of the interrupt. 100 uSec?

It probably doesn't really matter. For performance, we want to stream
both the journal writes and the primary disk writeback as much as
possible, but a bit of latency in the synchronisation between the two
ought to be largely irrelevant.

Much more significant than the external-journal case is probably the
stripe case, either with raid5, striped LVM or raid-1+0. In that
case, even sequential IO to the notionally-sequential journal may have
to be split over multiple disks, and at that point the pipeline stall
in the middle of IO that was supposed to be sequential will really
hurt.

Cheers,
Stephen

2002-03-05 22:35:05

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On March 5, 2002 08:40 am, Jens Axboe wrote:
> On Mon, Mar 04 2002, Daniel Phillips wrote:
> > > FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.
> >
> > I'm having a little trouble seeing the difference between WRITE10, WRITE12
> > and WRITE16. WRITE6 seems to be different only in not garaunteeing to
> > support the FUA (and one other) bit. I'm reading the Scsi Block Commands
>
> WRITE6 was deprecated because there is only one byte available to set
> transfer size. Enter WRITE10. WRITE12 allows the use of the streaming
> performance settings, that's the only functional difference wrt WRITE10
> iirc.

Thanks. This is poorly documented, to say the least.

> > (Side note: how nice it would be if t10.org got a clue and posted their
> > docs in html, in addition to the inconvenient, unhyperlinked, proprietary
> > format pdfs.)
>
> See the mtfuji docs as an example for how nicely pdf's can be setup too.

Do you have a url?

> The thought of substituting that for a html version makes me want to
> barf.

Who said substitute? Provide beside, as is reasonable. For my part,
pdf's tend to cause severe indigestion, if not actually cause
regurgitation.

--
Daniel

2002-03-05 23:01:44

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On March 5, 2002 08:09 am, Jeremy Higdon wrote:
> On Mar 4, 8:57am, Daniel Phillips wrote:
> >
> > On March 4, 2002 07:09 am, Jeremy Higdon wrote:
> > > On Mar 4, 6:31am, Daniel Phillips wrote:
> > > > On March 4, 2002 05:21 am, Jeremy Higdon wrote:
> > > > > I have never heard of
> > > > > any implied requirement to flush to media when a drive receives an
> > > > > ordered tag and WCE is set. It does seem like a useful feature to have
> > > > > in the standard, but I don't think it's there.
> > > >
> > > > It seems to be pretty strongly implied that things should work that way.
> > > > What is the use of being sure the write with the ordered tag is on media
> > > > if you're not sure about the writes that were supposedly supposed to
> > > > precede it? Spelling this out would indeed be helpful.
> > >
> > > WCE==1 and ordered tag means that the data for previous commands is in
> > > the drive buffer before the data for the ordered tag is in the drive
> > > buffer.
> >
> > Right, and what we're talking about is going further and requiring that WCE=0
> > and ordered tag means the data for previous commands is *not* in the buffer,
> > i.e., on the platter, which is the only interpretation that makes sense.
>
> Sorry to be slow here, but if WCE=0, then commands aren't complete until
> data is on the media,

Sorry, I meant FUA, not WCE. For this error I offer the apology that there
is a whole new set of TLA's to learn here, and I started yesterday.

> so since previous commands don't complete until
> data is on the media, and they must complete before the ordered tag
> command does, what you say would have to be the case. I thought the idea
> was to buffer commands to drive memory (so that the drive could increase
> performance by writing back to back commands without losing a rev) and
> then issue a command with a "flush" side effect.
>
> Here is an interesting question. If you use WCE=1 and then send an
> ordered tag with FUA=1, does that imply that data from previous
> write commands is flushed to media? I don't think so, though it
> would be a useful feature if it did.

That's my point all right. And what I tried to say is, it's useless to
have it otherwise, so we should now start beating up drive makers to do it
this way (I don't think they'll need a lot of convincing actually) and we
should write a test procedure to determine which drives do it correctly,
according to our definition of correctness. If we agree on what is
correct of course.

--
Daniel

2002-03-05 23:06:44

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On March 5, 2002 08:22 am, Jeremy Higdon wrote:
> On Mar 4, 8:57am, James Bottomley wrote:
> >
> > > 2a) Are the filesystems asking for something impossible? Can drives
> > > really write block N and N+1, making sure to commit N to media before
> > > N+1 (including an abort on N+1 if N fails), but still keeping up a
> > > nice seek free stream of writes?
> >
> > These are the "big" issues. There's not much point doing all the work to
> > implement ordered tags, if the end result is going to be no gain in
> > performance.
>
> If a drive does reduced latency writes, then blocks can be written out
> of order. Also, for a trivial case: with hardware RAIDs, when the
> data for a single command is split across multiple drives, you can get
> data blocks written out of order, no matter what you do.

That's ok, the journal takes care of this. And hence the need to be so
careful about how the journal commit is handled.

> I don't think a filesystem can make any assumptions about blocks within
> a single command, though with ordered tags (assuming driver and device
> support) and no write caching, it can make assumptions between commands.

We're trying to get rid of the 'no write caching' requirement.

--
Daniel

2002-03-06 14:04:34

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On March 4, 2002 03:48 pm, James Bottomley wrote:
> [email protected] said:
> > I've been following the thread, I hope I haven't missed anything
> > fundamental. A better long term solution is to have ordered tags work
> > as designed. It's not broken by design is it, just implementation?
>
> There is actually one hole in the design: A scsi device may accept a command
> with an ordered tag, disconnect and at a later time reconnect and return a
> QUEUE FULL status indicating that the tag must be retried. In the time
> between the disconnect and reconnect, the standard doesn't require that no
> other tags be accepted, so if the local flow control conditions abate, the
> device is allowed to accept and execute a tag sent down in between the
> disconnect and reconnect.

How can a drive can accept a command while it is disconnected from the bus.
Did you mean that after it reconnects it might refuse the ordered tag and
accept another? That would be a bug, I'd think.

> I think this would introduce a very minor deviation where one tag could
> overtake another, but we may still get a useable implementation even with this.

It would mean we would have to wait for completion of the tagged command
before submitting any more commands. Not nice, but not horribly costly
either.

--
Daniel

2002-03-06 14:35:10

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

[email protected] said:
> How can a drive can accept a command while it is disconnected from the
> bus. Did you mean that after it reconnects it might refuse the ordered
> tag and accept another? That would be a bug, I'd think.

Disconnect is SCSI slang for releasing the bus to other uses, it doesn't imply
electrical disconnection from it. The architecture of SCSI is like this, the
usual (and simplified) operation of a single command is:

- Initiator selects device and sends command and tag information.
- device disconnects
....
- device reselects initiator, presents tag and demands to transfer data (in
the direction dictated by the command).
- device may disconnect and reselect as many times as it wishes during data
transfer as dictated by its flow control (at least one block of data must
transfer for each reselection)
- device disconnects to complete operation
...
- device reselects and presents tag and status (command is now complete)

A tag is like a temporary ticket for identifying the command in progress.

During the (...) phases, the bus is free and the initiator is able to send
down new commands with different tags. If the device isn't going to be able
to accept the command, it is allowed to skip the data transfer phase and go
straight to status and present a QUEUE FULL status return. However, there is
still a disconnected period where the initiator doesn't know the command won't
be accepted and may send down other tagged commands.

> It would mean we would have to wait for completion of the tagged
> command before submitting any more commands. Not nice, but not
> horribly costly either.

But if we must await completion of ordered tags just to close this hole, it
makes the most sense to do it in the bio layer (or the journal layer, where
the wait is currently being done anyway) since it is generic to every low
level driver.

James

2002-03-10 05:42:40

by Douglas Gilbert

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

"Stephen C. Tweedie" wrote:
>
> Hi,
>
> On Sun, Mar 03, 2002 at 10:09:35PM -0800, Jeremy Higdon wrote:
>
> > > WCE is per-command? And 0 means no caching, so the command must complete
> > > when the data is on the media?
> >
> > My reading is that WCE==1 means that the command is complete when the
> > data is in the drive buffer.
>
> Even if WCE is enabled in the caching mode page, we can still set FUA
> (Force Unit Access) in individual write commands to force platter
> completion before commands complete.
>
> Of course, it's a good question whether this is honoured properly on
> all drives.
>
> FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.

Stephen,
FUA is also available on WRITE16. The same FUA support pattern
applies to the READ6,10,12 and 16 series. Interestingly if a
WRITE10 is called with FUA==0 followed by a READ10 with FUA=1
on the same block(s) then the READ causes the a flush from the
cache to the platter (if it hasn't already been done). [It
would be pretty ugly otherwise :-)]

Also SYNCHRONIZE CACHE(10) allows a range of blocks to be sent
to the platter but the size of the range is limited to 2**16 - 1
blocks which is probably too small to be useful. If the
"number of blocks" field is set to 0 then the whole disk cache
is flushed to the platter. There is a SYNCHRONIZE CACHE(16)
defined in recent sbc2 drafts that allows a 32 bit range
but it is unlikely to appear on any disk any time soon. There
is also an "Immed"-iate bit on these sync_cache commands
that may be of interest. When set this bit instructs the
target to respond with a good status immediately on receipt
of the command (and thus before the dirty blocks of the disk
cache are flushed to the platter).

Doug Gilbert

2002-03-11 11:13:24

by Kurt Garloff

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

Hi Doug,

On Sun, Mar 10, 2002 at 12:24:12AM -0500, Douglas Gilbert wrote:
> "Stephen C. Tweedie" wrote:
> > Even if WCE is enabled in the caching mode page, we can still set FUA
> > (Force Unit Access) in individual write commands to force platter
> > completion before commands complete.
> >
> > Of course, it's a good question whether this is honoured properly on
> > all drives.
> >
> > FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.
>
> Stephen,
[...]
>
> Also SYNCHRONIZE CACHE(10) allows a range of blocks to be sent
> to the platter but the size of the range is limited to 2**16 - 1
> blocks which is probably too small to be useful. If the
> "number of blocks" field is set to 0 then the whole disk cache
> is flushed to the platter.

Which I think we should send before shutdown (and possible poweroff) for
disks (DASDs), Write-Once and Optical Memory devices. (Funny enough, the
SCSI spec also lists SYNCHRONIZE_CACHE for CD-Rom devices
Unfortunately, SYNCHRONIZE CACHE is optional, so we would need to ignore any
errors returned by this command.

Regards,
--
Kurt Garloff <[email protected]> Eindhoven, NL
GPG key: See mail header, key servers Linux kernel development
SuSE Linux AG, Nuernberg, DE SCSI, Security

Attachments:

(No filename) (1.28 kB)
(No filename) (232.00 B)
Download all attachments

2002-03-11 11:36:04

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

Hi,

On Sun, Mar 10, 2002 at 12:24:12AM -0500, Douglas Gilbert wrote:

> > FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.
>
> Stephen,
> FUA is also available on WRITE16.

I said WRITE6, not WRITE16. :-) WRITE6 uses the low 5 bits of the LUN
byte for the top bits of the block number; WRITE10 and later use those
5 bits for DPO/FUA etc. But WRITE6 is a horribly limited interface:
you only have 21 bits of block number for a start, so it's limited to
1GB on 512-byte-sector devices. We can probably ignore WRITE6 safely
enough.

--Stephen

2002-03-11 17:17:32

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

This patch (against 2.4.18) addresses our synchronisation problems with write
back caches only, not the ordering problem with tags.

It probes the cache type on attach and inserts synchronisation instructions on
release() (i.e. unmount) or if the reboot notifier is called.

How would you like the cache synchronize instruction plugged into the journal
writes? I can do it either by exposing an ioctl which the journal code can
use, or I can try to use the write barrier (however, the bio layer is going to
have to ensure the ordering if I do that).

James

Attachments:

sd-cache-2.4.18.diff (7.26 kB)
sd-cache-2.4.18.diff

2002-03-12 01:17:32

by Masanori Goto

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

At Mon, 11 Mar 2002 12:13:00 +0100,
Kurt Garloff <[email protected]> wrote:
> On Sun, Mar 10, 2002 at 12:24:12AM -0500, Douglas Gilbert wrote:
> > "Stephen C. Tweedie" wrote:
> > > Even if WCE is enabled in the caching mode page, we can still set FUA
> > > (Force Unit Access) in individual write commands to force platter
> > > completion before commands complete.
> > >
> > > Of course, it's a good question whether this is honoured properly on
> > > all drives.
> > >
> > > FUA is not available on WRITE6, only WRITE10 or WRITE12 commands.
> >
> > Stephen,
> [...]
> >
> > Also SYNCHRONIZE CACHE(10) allows a range of blocks to be sent
> > to the platter but the size of the range is limited to 2**16 - 1
> > blocks which is probably too small to be useful. If the
> > "number of blocks" field is set to 0 then the whole disk cache
> > is flushed to the platter.
>
> Which I think we should send before shutdown (and possible poweroff) for
> disks (DASDs), Write-Once and Optical Memory devices. (Funny enough, the
> SCSI spec also lists SYNCHRONIZE_CACHE for CD-Rom devices
> Unfortunately, SYNCHRONIZE CACHE is optional, so we would need to ignore any
> errors returned by this command.

I agree.
BTW, power management like suspend/resume needs
SYNCHRONIZE_CACHE for the broken HDD/controller...?

-- gotom

2002-03-12 06:58:50

[permalink] [raw]

Subject: Re: [PATCH] 2.4.x write barriers (updated for ext3)

On Mon, Mar 11 2002, Kurt Garloff wrote:
> disks (DASDs), Write-Once and Optical Memory devices. (Funny enough, the
> SCSI spec also lists SYNCHRONIZE_CACHE for CD-Rom devices

Hey, I use SYNCHRONIZE_CACHE in the packet writing stuff for CD-ROM's
all the time :-). Not all are read-only. In fact, Peter Osterlund
discovered that if you have pending writes on the CD-ROM it's a really
good idea to sync the cache prior to starting reads or they have a nasty
tendency to time out.

--
Jens Axboe

2002-03-12 07:01:51