2019-01-10 14:30:03

by Kurt Miller

[permalink] [raw]
Subject: Block device flush ordering

For a well behaved block device that has a writeback cache,
what is the proper behavior of flush when there are more
then one outstanding flush operations? Is it;

Flush all writes seen since the last flush.
or
Flush all writes received prior to the flush including
those before any prior flush.

For example take the following order of requests presented
to the block device:

writes 1-5
flush 1
write 6
flush 2

Can flush 2 finish with success as soon as write 6 is flushed
(which may be before flush 1 success)? Or must it wait for
all prior write operations to flush (writes 1-6)?

This question has come up in our implementation of an NBD
user-space block device and have not found a definitive answer
on which behavior is correct for us to conform to. We want to
ensure we behave as required for file-system commit write
ordering.

Best,
-Kurt


2019-01-14 16:45:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Block device flush ordering

On Mon, Jan 14, 2019 at 09:42:44AM +1100, Dave Chinner wrote:
> On Thu, Jan 10, 2019 at 09:30:01AM -0500, Kurt Miller wrote:
> > For a well behaved block device that has a writeback cache,
> > what is the proper behavior of flush when there are more
> > then one outstanding flush operations? Is it;
> >
> > Flush all writes seen since the last flush.
> > or
> > Flush all writes received prior to the flush including
> > those before any prior flush.

The requirement is that all write operations that have been completed
before the flush was seen are on stable storage. How that is
implemented in detail is up to the device. The typical implementation
is simply to writeback the whole cache everytime a flush operation
is received.

> >
> > For example take the following order of requests presented
> > to the block device:
> >
> > writes 1-5
> > flush 1
> > write 6
> > flush 2
> >
> > Can flush 2 finish with success as soon as write 6 is flushed
> > (which may be before flush 1 success)? Or must it wait for
> > all prior write operations to flush (writes 1-6)?

No. For all the usual protocols as well as the linux kernel semantics
there is no overall command ordering, especially as there is no way
to even enforce that in a multi-queue environment.

>
> * C1. At any given time, only one flush shall be in progress. This makes
> * double buffering sufficient.

Very specific implementation detail inside the request layer.

> Then flush 1 does not guarantee any of the writes are on stable
> storage. They *may* be on stable storage if the timing is right, but
> it is not guaranteed by the OS code. Likewise, flush 2 only
> guarantees writes 1, 3 and 5 are on stable storage becase they are
> the only writes that have been signalled as complete when flush 2
> was submitted.

Exactly.

2019-01-15 14:35:44

by Kurt Miller

[permalink] [raw]
Subject: Re: Block device flush ordering

On Mon, 2019-01-14 at 08:45 -0800, Christoph Hellwig wrote:
> On Mon, Jan 14, 2019 at 09:42:44AM +1100, Dave Chinner wrote:
> >
> > On Thu, Jan 10, 2019 at 09:30:01AM -0500, Kurt Miller wrote:
> > >
> > > For a well behaved block device that has a writeback cache,
> > > what is the proper behavior of flush when there are more
> > > then one outstanding flush operations? Is it;
> > >
> > > Flush all writes seen since the last flush.
> > > or
> > > Flush all writes received prior to the flush including
> > > those before any prior flush.
> The requirement is that all write operations that have been completed
> before the flush was seen are on stable storage.??How that is
> implemented in detail is up to the device.??The typical implementation
> is simply to writeback the whole cache everytime a flush operation
> is received.
>
> >
> > >
> > >
> > > For example take the following order of requests presented
> > > to the block device:
> > >
> > > writes 1-5
> > > flush 1
> > > write 6
> > > flush 2
> > >
> > > Can flush 2 finish with success as soon as write 6 is flushed
> > > (which may be before flush 1 success)? Or must it wait for
> > > all prior write operations to flush (writes 1-6)?
> No.??For all the usual protocols as well as the linux kernel semantics
> there is no overall command ordering, especially as there is no way
> to even enforce that in a multi-queue environment.
>
> >
> >
> > ?* C1. At any given time, only one flush shall be in progress.??This makes
> > ?*?????double buffering sufficient.
> Very specific implementation detail inside the request layer.
>
> >
> > Then flush 1 does not guarantee any of the writes are on stable
> > storage. They *may* be on stable storage if the timing is right, but
> > it is not guaranteed by the OS code. Likewise, flush 2 only
> > guarantees writes 1, 3 and 5 are on stable storage becase they are
> > the only writes that have been signalled as complete when flush 2
> > was submitted.
> Exactly.

Thank you both for the detailed answers. They have been very helpful.
Also after spending an afternoon reading kernel code (xlog_sync though
blk_flush_complete_seq) I understand it better. The multiple concurrent
flush requests comment I made in another reply was a logging issue in
our nbd implementation where we were logging completions after replying
to the kernel. As a result our log messages were out of order and
misleading. With that corrected in our code we see only one flush at a
time.

Best,
-Kurt

2019-01-11 09:24:31

by Stefan Ring

[permalink] [raw]
Subject: Re: Block device flush ordering

On Thu, Jan 10, 2019 at 3:31 PM Kurt Miller <[email protected]> wrote:
>
> For a well behaved block device that has a writeback cache,
> what is the proper behavior of flush when there are more
> then one outstanding flush operations? Is it;
>
> Flush all writes seen since the last flush.
> or
> Flush all writes received prior to the flush including
> those before any prior flush.
>
> For example take the following order of requests presented
> to the block device:
>
> writes 1-5
> flush 1
> write 6
> flush 2
>
> Can flush 2 finish with success as soon as write 6 is flushed
> (which may be before flush 1 success)? Or must it wait for
> all prior write operations to flush (writes 1-6)?
>
> This question has come up in our implementation of an NBD
> user-space block device and have not found a definitive answer
> on which behavior is correct for us to conform to. We want to
> ensure we behave as required for file-system commit write
> ordering.

As an interested outstanding observer who has had a bit of exposure to
memory models I would pose the question differently: Should flushes be
allowed to execute concurrently or should there be a total order? If a
total order is imposed, the premise of the question does not exist,
and otherwise I cannot see a single good reason to "wait for all prior
write operations to flush" because the second thread (the one
executing write 6 and flush 2) cannot even determine in a non-esoteric
way if another flush is ongoing or not.

2019-01-13 22:42:48

by Dave Chinner

[permalink] [raw]
Subject: Re: Block device flush ordering

[ cc'd [email protected], where questions about block
device behaviour are better directed. ]

On Thu, Jan 10, 2019 at 09:30:01AM -0500, Kurt Miller wrote:
> For a well behaved block device that has a writeback cache,
> what is the proper behavior of flush when there are more
> then one outstanding flush operations? Is it;
>
> Flush all writes seen since the last flush.
> or
> Flush all writes received prior to the flush including
> those before any prior flush.
>
> For example take the following order of requests presented
> to the block device:
>
> writes 1-5
> flush 1
> write 6
> flush 2
>
> Can flush 2 finish with success as soon as write 6 is flushed
> (which may be before flush 1 success)? Or must it wait for
> all prior write operations to flush (writes 1-6)?

Don't take what I say as gospel, but according to block/blk-flush.c:

.....
* Currently, the following conditions are used to determine when to issue
* flush.
*
* C1. At any given time, only one flush shall be in progress. This makes
* double buffering sufficient.
.....

However, flushes can be deferred and re-ordered vs other non-flush
write IO dispatch. As such, the rules we work to with filesystems is
that a flush only guarantees IO that is already completed will be
written to stable storage. i.e. the filesystem has to wait for IO
completion of a write IO it needs to be stable before it can issue
(and wait for) a flush that will guarantee that it is on stable
storage.

IOWs, if your above scenario is:

submit writes 1-5
flush 1
submit write 6
writes 1,3,5 complete
flush 2
writes 2,4,6 complete

Then flush 1 does not guarantee any of the writes are on stable
storage. They *may* be on stable storage if the timing is right, but
it is not guaranteed by the OS code. Likewise, flush 2 only
guarantees writes 1, 3 and 5 are on stable storage becase they are
the only writes that have been signalled as complete when flush 2
was submitted.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-12 00:30:50

by Kurt Miller

[permalink] [raw]
Subject: Re: Block device flush ordering

On Fri, 2019-01-11 at 10:24 +0100, Stefan Ring wrote:
> On Thu, Jan 10, 2019 at 3:31 PM Kurt Miller <[email protected]> wrote:
> >
> >
> > For a well behaved block device that has a writeback cache,
> > what is the proper behavior of flush when there are more
> > then one outstanding flush operations? Is it;
> >
> > Flush all writes seen since the last flush.
> > or
> > Flush all writes received prior to the flush including
> > those before any prior flush.
> >
> > For example take the following order of requests presented
> > to the block device:
> >
> > ????????writes 1-5
> > ????????flush 1
> > ????????write 6
> > ????????flush 2
> >
> > Can flush 2 finish with success as soon as write 6 is flushed
> > (which may be before flush 1 success)? Or must it wait for
> > all prior write operations to flush (writes 1-6)?
> >
> > This question has come up in our implementation of an NBD
> > user-space block device and have not found a definitive answer
> > on which behavior is correct for us to conform to. We want to
> > ensure we behave as required for file-system commit write
> > ordering.
> As an interested outstanding observer who has had a bit of exposure to
> memory models I would pose the question differently: Should flushes be
> allowed to execute concurrently or should there be a total order? If a
> total order is imposed, the premise of the question does not exist,
> and otherwise I cannot see a single good reason to "wait for all prior
> write operations to flush" because the second thread (the one
> executing write 6 and flush 2) cannot even determine in a non-esoteric
> way if another flush is ongoing or not.

Hi Stefan,

Thank you for your comments. Our nbd block device implementation is
asynchronous in nature. We are able to conform to either behavior.

I can confirm that the kernel does in fact send multiple concurrent
REQ_OP_FLUSH requests to block devices. So I'm trying to determine what
behavior is acceptable when this occurs. Should we impose total ordering
of the flush operations or allow flush operations to complete out of
order when they finish first?

Best,
-Kurt