md_flush_mddev just passes on the sector relative to the raid device,
shouldn't it be translated somewhere?
On Friday September 3, [email protected] wrote:
> md_flush_mddev just passes on the sector relative to the raid device,
> shouldn't it be translated somewhere?
Yes. md_flush_mddev should simply be removed.
The functionality should be, and largely is, in the individual
personalities.
Is there documentation somewhere on exactly what an issue_flush_fn
should do (is it allowed to sleep? what must happen before it is
allowed to return, what is the "error_sector" for, that sort of thing).
I suspect that at least raid5 will need some fairly special handling.
NeilBrown
On Sat, Sep 04 2004, Neil Brown wrote:
> On Friday September 3, [email protected] wrote:
> > md_flush_mddev just passes on the sector relative to the raid device,
> > shouldn't it be translated somewhere?
>
> Yes. md_flush_mddev should simply be removed.
> The functionality should be, and largely is, in the individual
> personalities.
Yes, sorry I was a little lazy there even though I followed the plugging
conversion :(
> Is there documentation somewhere on exactly what an issue_flush_fn
> should do (is it allowed to sleep? what must happen before it is
> allowed to return, what is the "error_sector" for, that sort of thing).
It is allowed to sleep, you should return when the flush is complete.
error_sector is the failed location, which really should be a dev,sector
tupple.
--
Jens Axboe
On Saturday September 4, [email protected] wrote:
> On Sat, Sep 04 2004, Neil Brown wrote:
> > On Friday September 3, [email protected] wrote:
> > > md_flush_mddev just passes on the sector relative to the raid device,
> > > shouldn't it be translated somewhere?
> >
> > Yes. md_flush_mddev should simply be removed.
> > The functionality should be, and largely is, in the individual
> > personalities.
>
> Yes, sorry I was a little lazy there even though I followed the plugging
> conversion :(
>
> > Is there documentation somewhere on exactly what an issue_flush_fn
> > should do (is it allowed to sleep? what must happen before it is
> > allowed to return, what is the "error_sector" for, that sort of thing).
>
> It is allowed to sleep, you should return when the flush is complete.
> error_sector is the failed location, which really should be a dev,sector
> tupple.
Could I get a little more information about this function please.
I've read through the code, and there isn't much in the way of
examples to follow: only reiserfs uses it, only scsi-disk and ide-disk
supports it (I think).
It would seem that this is for write requests where b_end_io has already
been called, indicating that the data is safe, but that maybe the data
isn't really safe after-all, and blk_issue_flush needs to be called.
I would have thought that after b_end_io is called, that data should
be safe anyway. Not so?
How do you tell a device: it is OK to just leave the data is cache,
I'll call blk_issue_flush when I want it safe.
Is this related to barriers are all??
NeilBrown
On Mon, Sep 06 2004, Neil Brown wrote:
> On Saturday September 4, [email protected] wrote:
> > On Sat, Sep 04 2004, Neil Brown wrote:
> > > On Friday September 3, [email protected] wrote:
> > > > md_flush_mddev just passes on the sector relative to the raid device,
> > > > shouldn't it be translated somewhere?
> > >
> > > Yes. md_flush_mddev should simply be removed.
> > > The functionality should be, and largely is, in the individual
> > > personalities.
> >
> > Yes, sorry I was a little lazy there even though I followed the plugging
> > conversion :(
> >
> > > Is there documentation somewhere on exactly what an issue_flush_fn
> > > should do (is it allowed to sleep? what must happen before it is
> > > allowed to return, what is the "error_sector" for, that sort of thing).
> >
> > It is allowed to sleep, you should return when the flush is complete.
> > error_sector is the failed location, which really should be a dev,sector
> > tupple.
>
> Could I get a little more information about this function please.
> I've read through the code, and there isn't much in the way of
> examples to follow: only reiserfs uses it, only scsi-disk and ide-disk
> supports it (I think).
That is correct. The current definition is to ensure that previously
sent writes are on disk. I hope to tie a range to it in the future, for
devices that can optimize the flush in that case. So for ide with write
back caching, it's currently a FLUSH_CACHE command. Ditto for SCSI. SCSI
with write through cache can make it a noop as well.
> It would seem that this is for write requests where b_end_io has already
> been called, indicating that the data is safe, but that maybe the data
> isn't really safe after-all, and blk_issue_flush needs to be called.
Right on.
> I would have thought that after b_end_io is called, that data should
> be safe anyway. Not so?
Not necessarily, if you have write caching enabled.
> How do you tell a device: it is OK to just leave the data is cache,
> I'll call blk_issue_flush when I want it safe.
How would md know? The lower level driver knows what to do (if anything)
to ensure the data is safe.
> Is this related to barriers are all??
Yes and no. Currently it's used to fsync(), but can be used for
anything where you want to insert a flush point without having a piece
of data to tie it to.
--
Jens Axboe
On Mer, 2004-09-08 at 10:23, Jens Axboe wrote:
> That is correct. The current definition is to ensure that previously
> sent writes are on disk. I hope to tie a range to it in the future, for
> devices that can optimize the flush in that case. So for ide with write
> back caching, it's currently a FLUSH_CACHE command. Ditto for SCSI. SCSI
> with write through cache can make it a noop as well.
Some semantics questions I have thinking about it from the I2O and
aacraid side: You talk about it as a barrier. Can other I/O cross the
cache flush ? In other words if I issue a flush_cache and continue doing
I/O the flush will finish when the I/O outstanding at that time has
completed but other I/O may get scheduled to disk first.
Secondly what are the intended semantics for a flush error ?
On Wed, Sep 08 2004, Alan Cox wrote:
> On Mer, 2004-09-08 at 10:23, Jens Axboe wrote:
> > That is correct. The current definition is to ensure that previously
> > sent writes are on disk. I hope to tie a range to it in the future, for
> > devices that can optimize the flush in that case. So for ide with write
> > back caching, it's currently a FLUSH_CACHE command. Ditto for SCSI. SCSI
> > with write through cache can make it a noop as well.
>
> Some semantics questions I have thinking about it from the I2O and
> aacraid side: You talk about it as a barrier. Can other I/O cross the
> cache flush ? In other words if I issue a flush_cache and continue doing
> I/O the flush will finish when the I/O outstanding at that time has
> completed but other I/O may get scheduled to disk first.
That's a worry if it really does that - does it, or are you just
speculating about possible problems?
> Secondly what are the intended semantics for a flush error ?
It's up to the issue. For IDE it would ideally be issuing FLUSH_CACHE
repeatedly until it doesn't error anymore, but keeping track of the
error location. Come to think of it, we should pass down the range right
now to flag which range we are actually interested in being errored on.
--
Jens Axboe
On Mer, 2004-09-08 at 16:46, Jens Axboe wrote:
> That's a worry if it really does that - does it, or are you just
> speculating about possible problems?
I2O defines cache flush very losely. It flushes the cache and returns
when the cache has been flushed. From playing with the controllers I
have it seems some at least merge further queued writes into the output
stream. Thus if I issue
write 1, 2, 3, 4 , 40, 41, flush cache, write 5, 6, 100
it'll write 1,2,3,4,5,6, 40, 41, report flush cache complete.
Obviously I can implement full barrier semantics in the driver if need
be but that would cost performance hence the question.
On Wed, Sep 08 2004, Alan Cox wrote:
> On Mer, 2004-09-08 at 16:46, Jens Axboe wrote:
> > That's a worry if it really does that - does it, or are you just
> > speculating about possible problems?
>
> I2O defines cache flush very losely. It flushes the cache and returns
> when the cache has been flushed. From playing with the controllers I
> have it seems some at least merge further queued writes into the output
> stream. Thus if I issue
>
> write 1, 2, 3, 4 , 40, 41, flush cache, write 5, 6, 100
>
> it'll write 1,2,3,4,5,6, 40, 41, report flush cache complete.
>
> Obviously I can implement full barrier semantics in the driver if need
> be but that would cost performance hence the question.
Precisely, it's always possible to just drop queueing depth to zero at
that point. If I2O really does reorder around the cache flush (this
seems broken...), then you probably should.
--
Jens Axboe
> Precisely, it's always possible to just drop queueing depth to zero at
> that point. If I2O really does reorder around the cache flush (this
> seems broken...),
why does this seem broken? semantics of "cache flush guarantees that all
io submitted prior to it hits the spindle" are quite sane imo; no
guarantee of later submitted IO.. compare the unix "sync" command; same
level of semantics.
On Thu, Sep 09 2004, Arjan van de Ven wrote:
>
> > Precisely, it's always possible to just drop queueing depth to zero at
> > that point. If I2O really does reorder around the cache flush (this
> > seems broken...),
>
> why does this seem broken? semantics of "cache flush guarantees that all
> io submitted prior to it hits the spindle" are quite sane imo; no
> guarantee of later submitted IO.. compare the unix "sync" command; same
> level of semantics.
Depends on your angle, I think it breaks the principle of least
surprise.
--
Jens Axboe
On Iau, 2004-09-09 at 09:29, Jens Axboe wrote:
> > why does this seem broken? semantics of "cache flush guarantees that all
> > io submitted prior to it hits the spindle" are quite sane imo; no
> > guarantee of later submitted IO.. compare the unix "sync" command; same
> > level of semantics.
>
> Depends on your angle, I think it breaks the principle of least
> surprise.
As far as I can ascertain raid controllers in general follow this set of
semantics. Its less of an issue for many of them with battery backup
obviously.
It also makes a lot of sense at the hardware level for performance
especially when dealing with raid.
Alan
On Thu, Sep 09 2004, Alan Cox wrote:
> On Iau, 2004-09-09 at 09:29, Jens Axboe wrote:
> > > why does this seem broken? semantics of "cache flush guarantees that all
> > > io submitted prior to it hits the spindle" are quite sane imo; no
> > > guarantee of later submitted IO.. compare the unix "sync" command; same
> > > level of semantics.
> >
> > Depends on your angle, I think it breaks the principle of least
> > surprise.
>
> As far as I can ascertain raid controllers in general follow this set of
> semantics. Its less of an issue for many of them with battery backup
> obviously.
>
> It also makes a lot of sense at the hardware level for performance
> especially when dealing with raid.
Yes. As long as the required semantics aren't explicitly guaranteed in
the specification, we should not rely on it.
--
Jens Axboe
On Wed, Sep 08, 2004 at 11:21:39PM +0100, Alan Cox wrote:
> I2O defines cache flush very losely. It flushes the cache and returns
[...]
> write 1, 2, 3, 4 , 40, 41, flush cache, write 5, 6, 100
> it'll write 1,2,3,4,5,6, 40, 41, report flush cache complete.
which, if 5,6 are the metadata updates beloging to logfile writes
40,41 and the system powers down between 5 and 41 spells trouble.
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
**** "Linux is like a wigwam - no windows, no gates, apache inside!" ****