2016-10-02 16:17:19

by Alex Bligh

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support


> On 29 Sep 2016, at 17:59, Josef Bacik <[email protected]> wrote:
>
> On 09/29/2016 12:41 PM, Wouter Verhelst wrote:
>> On Thu, Sep 29, 2016 at 10:03:50AM -0400, Josef Bacik wrote:
>>> So think of it like normal disks with multiple channels. We don't send flushes
>>> down all the hwq's to make sure they are clear, we leave that decision up to the
>>> application (usually a FS of course).
>>
>> Well, when I asked earlier, Christoph said[1] that blk-mq assumes that
>> when a FLUSH is sent over one channel, and the reply comes in, that all
>> commands which have been received, regardless of which channel they were
>> received over, have reched disk.
>>
>> [1] Message-ID: <[email protected]>
>>
>> It is impossible for nbd to make such a guarantee, due to head-of-line
>> blocking on TCP.
>>
>
> Huh I missed that. Yeah that's not possible for us for sure, I think my option
> idea is the less awful way forward if we want to address that limitation. Thanks,

I think if the server supports flush (which you can tell), sending flush on
all channels is the only safe thing to do, without substantial protocol
changes (which I'm not sure how one would do given flush is in a sense
a synchronisation point). I think it's thus imperative this gets fixed
before the change gets merged.

--
Alex Bligh






2016-10-03 01:47:45

by Josef Bacik

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support


> On Oct 2, 2016, at 12:17 PM, Alex Bligh <[email protected]> wrote:
>
>
>> On 29 Sep 2016, at 17:59, Josef Bacik <[email protected]> wrote:
>>
>> On 09/29/2016 12:41 PM, Wouter Verhelst wrote:
>>>> On Thu, Sep 29, 2016 at 10:03:50AM -0400, Josef Bacik wrote:
>>>> So think of it like normal disks with multiple channels. We don't send flushes
>>>> down all the hwq's to make sure they are clear, we leave that decision up to the
>>>> application (usually a FS of course).
>>>
>>> Well, when I asked earlier, Christoph said[1] that blk-mq assumes that
>>> when a FLUSH is sent over one channel, and the reply comes in, that all
>>> commands which have been received, regardless of which channel they were
>>> received over, have reched disk.
>>>
>>> [1] Message-ID: <[email protected]>
>>>
>>> It is impossible for nbd to make such a guarantee, due to head-of-line
>>> blocking on TCP.
>>>
>>
>> Huh I missed that. Yeah that's not possible for us for sure, I think my option
>> idea is the less awful way forward if we want to address that limitation. Thanks,
>
> I think if the server supports flush (which you can tell), sending flush on
> all channels is the only safe thing to do, without substantial protocol
> changes (which I'm not sure how one would do given flush is in a sense
> a synchronisation point). I think it's thus imperative this gets fixed
> before the change gets merged.

It's not "broken", it's working as designed, and any fs on top of this patch will be perfectly safe because they all wait for their io to complete before issuing the FLUSH. If somebody wants to address the paranoid case later then all the power to them, but this works for my use case and isn't inherently broken. If it doesn't work for yours then don't use the feature, it's that simple. Thanks,

Josef

2016-10-03 07:21:03

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

On Mon, Oct 03, 2016 at 01:47:06AM +0000, Josef Bacik wrote:
> It's not "broken", it's working as designed, and any fs on top of this patch will be perfectly safe because they all wait for their io to complete before issuing the FLUSH. If somebody wants to address the paranoid case later then all the power to them, but this works for my use case and isn't inherently broken. If it doesn't work for yours then don't use the feature, it's that simple. Thanks,

Let's take one step back here. I agree with Josef that sending
one single flush is perfectly fine for all usual cases. The issue
that was brought up last time we had this discussion was that some
(I think mostly theoretical) backends could not be coherent and
this would be an issue.

So maybe the right way is to simply not support the current odd flush
defintion in the kernel then and require a new properly defined flush
version instead.

2016-10-03 07:49:24

by Wouter Verhelst

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

On Sun, Oct 02, 2016 at 05:17:14PM +0100, Alex Bligh wrote:
> On 29 Sep 2016, at 17:59, Josef Bacik <[email protected]> wrote:
> > Huh I missed that. Yeah that's not possible for us for sure, I think my option
> > idea is the less awful way forward if we want to address that limitation. Thanks,
>
> I think if the server supports flush (which you can tell), sending flush on
> all channels is the only safe thing to do, without substantial protocol
> changes (which I'm not sure how one would do given flush is in a sense
> a synchronisation point). I think it's thus imperative this gets fixed
> before the change gets merged.

Whoa there, Alex.

I don't think this should be a blocker. There is a theoretical problem
yes, but I believe it to be limited to the case where the client and the
server are not in the same broadcast domain, which is not the common
case (most NBD connections run either over the localhost iface, or to a
machine nearby). In the case where the client and server are on the same
LAN, random packet drop is highly unlikely, so TCP communication will
not be delayed and so the replies will, with high certainty, arrive in
the same order that they were sent.

Obviously the documentation for the "spawn multiple connections" option
in nbd-client needs to clearly state that it will decrease reliability
in this edge case, but I don't think that blocking this feature until a
solution for this problem is implemented is the right way forward. There
are valid use cases where using multiple connections is preferable, even
with the current state of affairs, and they do not all involve "switch
off flush".

Regards,

--
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
people in the world who think they really understand all of its rules,
and pretty much all of them are just lying to themselves too.
-- #debian-devel, OFTC, 2016-02-12

2016-10-03 07:52:23

by Wouter Verhelst

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

On Mon, Oct 03, 2016 at 12:20:49AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 03, 2016 at 01:47:06AM +0000, Josef Bacik wrote:
> > It's not "broken", it's working as designed, and any fs on top of this
> > patch will be perfectly safe because they all wait for their io to complete
> > before issuing the FLUSH. If somebody wants to address the paranoid case
> > later then all the power to them, but this works for my use case and isn't
> > inherently broken. If it doesn't work for yours then don't use the
> > feature, it's that simple. Thanks,
>
> Let's take one step back here. I agree with Josef that sending
> one single flush is perfectly fine for all usual cases. The issue
> that was brought up last time we had this discussion was that some
> (I think mostly theoretical) backends could not be coherent and
> this would be an issue.

Actually, I was pointing out the TCP head-of-line issue, where a delay
on the socket that contains the flush reply would result in the arrival
in the kernel block layer of a write reply before the said flush reply,
resulting in a write being considered part of the flush when in fact it
was not.

This is an edge case, and one highly unlikely to result in problems in
the common case (as per my other mail), but it is something to consider.

> So maybe the right way is to simply not support the current odd flush
> defintion in the kernel then and require a new properly defined flush
> version instead.

Can you clarify what you mean by that? Why is it an "odd flush
definition", and how would you "properly" define it?

Thanks,

--
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
people in the world who think they really understand all of its rules,
and pretty much all of them are just lying to themselves too.
-- #debian-devel, OFTC, 2016-02-12

2016-10-03 07:57:13

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

On Mon, Oct 03, 2016 at 09:51:49AM +0200, Wouter Verhelst wrote:
> Actually, I was pointing out the TCP head-of-line issue, where a delay
> on the socket that contains the flush reply would result in the arrival
> in the kernel block layer of a write reply before the said flush reply,
> resulting in a write being considered part of the flush when in fact it
> was not.

The kernel (or any other user of SCSI/ATA/NVMe-like cache flushes)
will wait for all I/O that needs to be in the cache for explicitly,
so this is not a problem.

> Can you clarify what you mean by that? Why is it an "odd flush
> definition", and how would you "properly" define it?

E.g. take the defintion from NVMe which also supports multiple queues:

"The Flush command shall commit data and metadata associated with the
specified namespace(s) to non-volatile media. The flush applies to all
commands completed prior to the submission of the Flush command.
The controller may also flush additional data and/or metadata from any
namespace."

The focus is completed - we need to get a reply to the host first
before we can send the flush command, so anything that we require
to be flushed needs to explicitly be completed first.

2016-10-03 11:34:41

by Alex Bligh

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support


> On 3 Oct 2016, at 08:57, Christoph Hellwig <[email protected]> wrote:
>
>> Can you clarify what you mean by that? Why is it an "odd flush
>> definition", and how would you "properly" define it?
>
> E.g. take the defintion from NVMe which also supports multiple queues:
>
> "The Flush command shall commit data and metadata associated with the
> specified namespace(s) to non-volatile media. The flush applies to all
> commands completed prior to the submission of the Flush command.
> The controller may also flush additional data and/or metadata from any
> namespace."
>
> The focus is completed - we need to get a reply to the host first
> before we can send the flush command, so anything that we require
> to be flushed needs to explicitly be completed first.

I think there are two separate issues here:

a) What's described as the "HOL blocking issue".

This comes down to what Wouter said here:

> Well, when I asked earlier, Christoph said[1] that blk-mq assumes that
> when a FLUSH is sent over one channel, and the reply comes in, that all
> commands which have been received, regardless of which channel they were
> received over, have reched disk.
>
> [1] Message-ID: <[email protected]>
>
> It is impossible for nbd to make such a guarantee, due to head-of-line
> blocking on TCP.

this is perfectly accurate as far as it goes, but this isn't the current
NBD definition of 'flush'.

That is (from the docs):

> All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) that the server completes (i.e. replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to thatNBD_CMD_FLUSH. This paragraph only applies if NBD_FLAG_SEND_FLUSH is set within the transmission flags, as otherwise NBD_CMD_FLUSH will never be sent by the client to the server.

I don't think HOL blocking is an issue here by that definition, because all FLUSH requires is that commands that are actually completed are flushed to disk. If there is head of line blocking which delays the arrival of a write issued before a flush, then the sender cannot be relying on whether that write is actually completed or not (or it would have waited for the result). The flush requires only that those commands COMPLETED are flushed to disk, not that those commands RECEIVED have been flushed to disk (and a fortiori not that those commands SENT FIRST) have been flushed to disk. From the point of view of the client, the flush can therefore only guarantee that the data associated with those commands for which it's actually received a reply prior to issuing the flush will be flushed, because the replies can be disordered too.

I don't think there is actually a problem here - Wouter if I'm wrong about this, I'd like to understand your argument better.



b) What I'm describing - which is the lack of synchronisation between channels.

Suppose you have a simple forking NBD server which uses (e.g.) a Ceph backend. Each process (i.e. each NBD channel) will have a separate connection to something with its own cache and buffering. Issuing a flush in Ceph requires waiting until a quorum of backends (OSDs) has been written to, and with a number of backends substantially greater than the quorum, it is not unlikely that a flush on one channel will not wait for writes on what Ceph considers a completely independent channel to have fully written (assuming the write completes before the flush is done).

The same would happen pretty trivially with a forking server that uses a process-space write-back cache.

This is because the spec when the spec says: "All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) that the server completes (i.e. replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to that NBD_CMD_FLUSH." what it currently means is actually "All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) ***ASSOCIATED WITH THAT CLIENT*** that the server completes (i.e. replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to that NBD_CMD_FLUSH".

So what we would need the spec to mean is "All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) ***ASSOCIATED WITH ANY CHANNEL OF THAT CLIENT*** that the server completes (i.e. replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to that NBD_CMD_FLUSH". And as we have no way to associate different channels of the same client, for servers that can't rely on the OS to synchronise flushing across different clients relating to the same file, in practice that means "All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) ***ASSOCIATED WITH ANY CLIENT AT ALL*** that the server completes (i.e. replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to that NBD_CMD_FLUSH" - i.e. a flush on any channel of any client must flush every channel of every client, because we have no easy way to tell which clients are in fact two channels. I have concerns over the scalability of this.

Now, in the reference server, NBD_CMD_FLUSH is implemented through an fdatasync(). Each client (and therefore each channel) runs in a different process.

Earlier in this thread, someone suggested that if this happens:

Process A Process B
========= =========

fd1=open("file123")
fd2=open("file123")

write(fd1, ...)
fdatasync("fd2")

then the fdatasync() is guaranteed to sync the write that Process A has written. This may or may not be the case under Linux (wiser minds than me will know). Is it guaranteed to be the case with (e.g.) the file on NFS? On all POSIX platforms?

Looking at http://pubs.opengroup.org/onlinepubs/009695399/functions/fdatasync.html I'd say it was a little ambiguous as to whether it will ALWAYS flush all data associated with the file even if it is being written by a different process (and a different FD).

If fdatasync() is always guaranteed to flush data associated with writes by a different process (and separately opened fd), then as it happens there won't be a problem on the reference server, just on servers that don't happen to use fdatasync() or similar to implement flushes, and which don't maintain their own caches. If fdatasync() is not so guaranteed, we have a problem with the reference server too, at least on some platforms and fling systems.

What I'm therefore asking for is either:
a) that the server can say 'if you are multichannel, you will need to send flush on each channel' (best); OR
b) that the server can say 'don't go multichannel'

as part of the negotiation stage. Of course as this is dependent on the backend, this is going to be something that is per-target (i.e. needs to come as a transmission flag or similar).

--
Alex Bligh





2016-10-03 14:33:26

by Josef Bacik

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

On 10/03/2016 07:34 AM, Alex Bligh wrote:
>
>> On 3 Oct 2016, at 08:57, Christoph Hellwig <[email protected]> wrote:
>>
>>> Can you clarify what you mean by that? Why is it an "odd flush
>>> definition", and how would you "properly" define it?
>>
>> E.g. take the defintion from NVMe which also supports multiple queues:
>>
>> "The Flush command shall commit data and metadata associated with the
>> specified namespace(s) to non-volatile media. The flush applies to all
>> commands completed prior to the submission of the Flush command.
>> The controller may also flush additional data and/or metadata from any
>> namespace."
>>
>> The focus is completed - we need to get a reply to the host first
>> before we can send the flush command, so anything that we require
>> to be flushed needs to explicitly be completed first.
>
> I think there are two separate issues here:
>
> a) What's described as the "HOL blocking issue".
>
> This comes down to what Wouter said here:
>
>> Well, when I asked earlier, Christoph said[1] that blk-mq assumes that
>> when a FLUSH is sent over one channel, and the reply comes in, that all
>> commands which have been received, regardless of which channel they were
>> received over, have reched disk.
>>
>> [1] Message-ID: <[email protected]>
>>
>> It is impossible for nbd to make such a guarantee, due to head-of-line
>> blocking on TCP.
>
> this is perfectly accurate as far as it goes, but this isn't the current
> NBD definition of 'flush'.
>
> That is (from the docs):
>
>> All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) that the server completes (i.e. replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to thatNBD_CMD_FLUSH. This paragraph only applies if NBD_FLAG_SEND_FLUSH is set within the transmission flags, as otherwise NBD_CMD_FLUSH will never be sent by the client to the server.
>
> I don't think HOL blocking is an issue here by that definition, because all FLUSH requires is that commands that are actually completed are flushed to disk. If there is head of line blocking which delays the arrival of a write issued before a flush, then the sender cannot be relying on whether that write is actually completed or not (or it would have waited for the result). The flush requires only that those commands COMPLETED are flushed to disk, not that those commands RECEIVED have been flushed to disk (and a fortiori not that those commands SENT FIRST) have been flushed to disk. From the point of view of the client, the flush can therefore only guarantee that the data associated with those commands for which it's actually received a reply prior to issuing the flush will be flushed, because the replies can be disordered too.
>
> I don't think there is actually a problem here - Wouter if I'm wrong about this, I'd like to understand your argument better.
>
>
>
> b) What I'm describing - which is the lack of synchronisation between channels.
>
> Suppose you have a simple forking NBD server which uses (e.g.) a Ceph backend. Each process (i.e. each NBD channel) will have a separate connection to something with its own cache and buffering. Issuing a flush in Ceph requires waiting until a quorum of backends (OSDs) has been written to, and with a number of backends substantially greater than the quorum, it is not unlikely that a flush on one channel will not wait for writes on what Ceph considers a completely independent channel to have fully written (assuming the write completes before the flush is done).
>
> The same would happen pretty trivially with a forking server that uses a process-space write-back cache.
>
> This is because the spec when the spec says: "All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) that the server completes (i.e. replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to that NBD_CMD_FLUSH." what it currently means is actually "All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) ***ASSOCIATED WITH THAT CLIENT*** that the server completes (i.e. replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to that NBD_CMD_FLUSH".
>
> So what we would need the spec to mean is "All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) ***ASSOCIATED WITH ANY CHANNEL OF THAT CLIENT*** that the server completes (i.e. replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to that NBD_CMD_FLUSH". And as we have no way to associate different channels of the same client, for servers that can't rely on the OS to synchronise flushing across different clients relating to the same file, in practice that means "All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) ***ASSOCIATED WITH ANY CLIENT AT ALL*** that the server completes (i.e. replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to that NBD_CMD_FLUSH" - i.e. a flush on any channel of any client must flush every channel of every client, because we have no easy way to tell which clients are in fact two channels. I have concerns over the scalability of this.
>
> Now, in the reference server, NBD_CMD_FLUSH is implemented through an fdatasync(). Each client (and therefore each channel) runs in a different process.
>
> Earlier in this thread, someone suggested that if this happens:
>
> Process A Process B
> ========= =========
>
> fd1=open("file123")
> fd2=open("file123")
>
> write(fd1, ...)
> fdatasync("fd2")
>
> then the fdatasync() is guaranteed to sync the write that Process A has written. This may or may not be the case under Linux (wiser minds than me will know). Is it guaranteed to be the case with (e.g.) the file on NFS? On all POSIX platforms?
>
> Looking at https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_009695399_functions_fdatasync.html&d=DQIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=sDzg6MvHymKOUgI8SFIm4Q&m=CrKTswZ5fz5tdtZvA9rerZHXb8O8O57LSOjNJN1ejms&s=bkOpj64mHN60JXapJ62GJe0Qtzp-ZWwVn91kXmJ247M&e= I'd say it was a little ambiguous as to whether it will ALWAYS flush all data associated with the file even if it is being written by a different process (and a different FD).
>
> If fdatasync() is always guaranteed to flush data associated with writes by a different process (and separately opened fd), then as it happens there won't be a problem on the reference server, just on servers that don't happen to use fdatasync() or similar to implement flushes, and which don't maintain their own caches. If fdatasync() is not so guaranteed, we have a problem with the reference server too, at least on some platforms and fling systems.
>
> What I'm therefore asking for is either:
> a) that the server can say 'if you are multichannel, you will need to send flush on each channel' (best); OR
> b) that the server can say 'don't go multichannel'
>
> as part of the negotiation stage. Of course as this is dependent on the backend, this is going to be something that is per-target (i.e. needs to come as a transmission flag or similar).
>

Ok I understand your objections now. You aren't arguing that we are unsafe by
default, only that we are unsafe with servers that do something special beyond
simply writing to a single disk or file. I agree this is problematic, but you
simply don't use this feature if your server can't deal with it well. Thanks,

Josef

2016-10-03 14:47:06

by Alex Bligh

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

Josef,

> On 3 Oct 2016, at 15:32, Josef Bacik <[email protected]> wrote:
>
> Ok I understand your objections now. You aren't arguing that we are unsafe by default, only that we are unsafe with servers that do something special beyond simply writing to a single disk or file. I agree this is problematic, but you simply don't use this feature if your server can't deal with it well.

Not quite. I am arguing that:

1. Parallel channels as currently implemented are inherently unsafe on some servers - I have given an example of such servers

2. Parallel channels as currently implemented may be unsafe on the reference server on some or all platforms (depending on the behaviour of fdatasync() which might varies between platforms)

3. Either the parallel channels stuff should issue flush requests on all channels, or the protocol should protect the unwitting user by negotiating an option to do so (or refusing multi-channel connects). It is not reasonable for an nbd client to have to know the intimate details of the server and its implementation of synchronisation primitives - saying 'the user should disable multiple channels' is not good enough, as this is what we have a protocol for.

"unsafe" here means 'do not conform to the kernel's requirements for flush semantics, whereas they did without multiple channels'

So from my point of view either we need to have an additional connection flag ("don't use multiple channels unless you are going to issue a flush on all channels") OR flush on all channels should be the default.

Whatever SOME more work needs to be done on your patch because there it current doesn't issue flush on all channels, and it currently does not have any way to prevent use of multiple channels.


I'd really like someone with linux / POSIX knowledge to confirm the linux and POSIX position on fdatasync of an fd opened by one process on writes made to the same file by another process. If on all POSIX platforms and all filing systems this is guaranteed to flush the second platform's writes, then I think we could argue "OK there may be some weird servers which might not support multiple channels and they just need a way of signalling that". If on the other hand there is no such cross platform guarantee, I think this means in essence even with the reference server, this patch is unsafe, and it needs adapting to send flushes on all channels - yes it might theoretically be possible to introduce IPC to the current server, but you'd still need some way of tying together channels from a single client.

--
Alex Bligh





2016-10-03 21:07:55

by Wouter Verhelst

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

Alex,
Christoph,

On Mon, Oct 03, 2016 at 12:34:33PM +0100, Alex Bligh wrote:
> On 3 Oct 2016, at 08:57, Christoph Hellwig <[email protected]> wrote:
> >> Can you clarify what you mean by that? Why is it an "odd flush
> >> definition", and how would you "properly" define it?
> >
> > E.g. take the defintion from NVMe which also supports multiple queues:
> >
> > "The Flush command shall commit data and metadata associated with the
> > specified namespace(s) to non-volatile media. The flush applies to all
> > commands completed prior to the submission of the Flush command.

Oh, prior to submission? Okay, that means I must have misunderstood what
you meant before; in that case there shouldn't be a problem.

> > The controller may also flush additional data and/or metadata from any
> > namespace."
> >
> > The focus is completed - we need to get a reply to the host first
> > before we can send the flush command, so anything that we require
> > to be flushed needs to explicitly be completed first.
>
> I think there are two separate issues here:
>
> a) What's described as the "HOL blocking issue".
>
> This comes down to what Wouter said here:
>
> > Well, when I asked earlier, Christoph said[1] that blk-mq assumes that
> > when a FLUSH is sent over one channel, and the reply comes in, that all
> > commands which have been received, regardless of which channel they were
> > received over, have reched disk.
> >
> > [1] Message-ID: <[email protected]>
> >
> > It is impossible for nbd to make such a guarantee, due to head-of-line
> > blocking on TCP.
>
> this is perfectly accurate as far as it goes, but this isn't the current
> NBD definition of 'flush'.

I didn't read it that way.

> That is (from the docs):
>
> > All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM)
> > that the server completes (i.e. replies to) prior to processing to a
> > NBD_CMD_FLUSH MUST be written to non-volatile storage prior to
> > replying to thatNBD_CMD_FLUSH.

This is somewhat ambiguous, in that (IMO) it doesn't clearly state the
point where the cutoff of "may not be on disk yet" is. What is
"processing"? We don't define that, and therefore it could be any point
between "receipt of the request message" and "sending the reply
message". I had interpreted it closer to the latter than was apparently
intended, but that isn't very useful; I see now that it should be closer
to the former; a more useful definition is probably something along the
following lines:

All write commands (that includes NBD_CMD_WRITE and NBD_CMD_TRIM)
for which a reply was received on the client side prior to the
transmission of the NBD_CMD_FLUSH message MUST be written to
non-volatile storage prior to replying to that NBD_CMD_FLUSH. A
server MAY process this command in ways that result committing more
data to non-volatile storage than is strictly required.

[...]
> I don't think there is actually a problem here - Wouter if I'm wrong
> about this, I'd like to understand your argument better.

No, I now see that there isn't, and I misunderstood things. However, I
do think we should update the spec to clarify this.

> b) What I'm describing - which is the lack of synchronisation between
> channels.
[... long explanation snipped...]

Yes, and I acknowledge that. However, I think that should not be a
blocker. It's fine to mark this feature as experimental; it will not
ever be required to use multiple connections to connect to a server.

When this feature lands in nbd-client, I plan to ensure that the man
page and -help output says something along the following lines:

use N connections to connect to the NBD server, improving performance
at the cost of a possible loss of reliability.

The interactions between multiple connections and the NBD_CMD_FLUSH
command, especially when the actual storage and the NBD server are not
physically on the same machine, are not currently well defined and not
completely understood, and therefore the use of multiple connections to
the same server could theoretically lead to data corruption and/or
loss. Use with caution.

This probably needs some better wording, but you get the idea.

(also, this will interact really really badly with the reference
implementation's idea of copy-on-write, I suppose, since that implements
COW on a per-socket basis with a per-IP diff file...)

Anyway, the point is that this is a feature which may still cause
problems in a number of edge cases and which should therefore not be the
default yet, but which can be useful in a number of common situations
for which NBD is used today.

[...]
> Now, in the reference server, NBD_CMD_FLUSH is implemented through an
> fdatasync().

Actually, no, the reference server uses fsync() for reasons that I've
forgotten (side note: you wrote it that way ;-)

[...]

--
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
people in the world who think they really understand all of its rules,
and pretty much all of them are just lying to themselves too.
-- #debian-devel, OFTC, 2016-02-12

2016-10-04 09:38:24

by Alex Bligh

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

Wouter,

>>> It is impossible for nbd to make such a guarantee, due to head-of-line
>>> blocking on TCP.
>>
>> this is perfectly accurate as far as it goes, but this isn't the current
>> NBD definition of 'flush'.
>
> I didn't read it that way.
>
>> That is (from the docs):
>>
>>> All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM)
>>> that the server completes (i.e. replies to) prior to processing to a
>>> NBD_CMD_FLUSH MUST be written to non-volatile storage prior to
>>> replying to thatNBD_CMD_FLUSH.
>
> This is somewhat ambiguous, in that (IMO) it doesn't clearly state the
> point where the cutoff of "may not be on disk yet" is. What is
> "processing"?

OK. I now get the problem. There are actually two types of HOL blocking,
server to client and client to server.

Before we allowed the server to issue replies out of order with requests.
However, the protocol did guarantee that the server saw requests in the
order presented by clients. With the proposed multi-connection support,
this changes. Whilst the client needed to be prepared for things to be
disordered by the server, the server did not previously need to be
be prepared for things being disordered by the client. And (more subtly)
the client could assume that the server got its own requests in the
order it sent them, which is important for flush the way written at the
moment.


Here's an actual illustration of the problem:

Currently we have:

Client Server
====== ======

TX: WRITE
RX: FLUSH
RX: WRITE
RX: FLUSH

Process write

Process flush including write

TX: write reply
TX: flush reply
RX: write reply
RX: flush reply


Currently the RX statements cannot be disordered. However the
server can process the requests in a different order. If it
does, the flush need not include the write, like this:

Client Server
====== ======

TX: WRITE
RX: FLUSH
RX: WRITE
RX: FLUSH

Process flush not including write

Process write

TX: flush reply
TX: write reply
RX: flush reply
RX: write reply

and the client gets to know of the fact, because the flush
reply comes before the write reply. It can know it's data has
not been flushed. It could send another flush in this case, or
simply change its code to not send the flush until the write
has been received.

However, with the multi-connection support, both the replies
and the requests can be disordered. So the client can ONLY
know a flush has been completed if it has received a reply
to the write before it sends the flush.

This is in my opinion problematic, as what you want to do as
a client is stream requests (write, write, write, flush, write,
write, write). If those go down different channels, AND you
don't wait for a reply, you can no longer safely stream requests
at all. Now you need to wait for the flush request to respond
before sending another write (if write ordering to the platter
is important), which seems to defeat the object of streaming
commands.

An 'in extremis' example would be a sequence of write / flush
requests sent down two channels, where the write requests all
end up on one channel, and the flush requests on the other,
and the write channel is serviced immediately and the flush
requests delayed indefinitely.

> We don't define that, and therefore it could be any point
> between "receipt of the request message" and "sending the reply
> message". I had interpreted it closer to the latter than was apparently
> intended, but that isn't very useful;

The thing is the server doesn't know what replies the client has
received, only the replies it has sent. Equally the server doesn't
know what commands the client has sent, only what commands it has
received.

As currently written, it's a simple rule, NBD_CMD_FLUSH means
"Mr Server: you must make sure that any write you have sent a
reply to must now be persisted on disk. If you haven't yet sent
a reply to a write - perhaps because due to HOL blocking you
haven't received it, or perhaps it's still in progress, or perhaps
it's finished but you haven't sent the reply - don't worry".

The promise to the the client is that all the writes to which the
server has sent a reply are now on disk. But the client doesn't
know what replies the server has sent a reply to. It only knows
which replies it has received (which will be a subset of those). So
to the client it means that the server has persisted to disk all
those commands to which it has received a reply. However, to the
server, the 'MUST' condition needs to refer to the replies it sent,
not the replies the client receives.

I think "before processing" here really just means "before sending
a reply to the NBD_CMD_FLUSH". I believe the 'before processing'
phrase came from the kernel wording.

> I see now that it should be closer
> to the former; a more useful definition is probably something along the
> following lines:
>
> All write commands (that includes NBD_CMD_WRITE and NBD_CMD_TRIM)
> for which a reply was received on the client side prior to the

No, that's wrong as the server has no knowledge of whether the client
has actually received them so no way of knowing to which writes that
would reply. It thus has to ensure it covers a potentially larger
set of writes - those for which a reply has been sent, as all those
MIGHT have been received by the client.

> transmission of the NBD_CMD_FLUSH message MUST be written to

no that's wrong because the server has no idea when the client
transmitted the NBD_CMD_FLUSH message. It can only deal with
events it knows about. And the delimiter is (in essence) those
write commands that were replied to prior to the sending of the
reply to the NBD_CMD_FLUSH - of course it can flush others too.

> non-volatile storage prior to replying to that NBD_CMD_FLUSH. A
> server MAY process this command in ways that result committing more
> data to non-volatile storage than is strictly required.

I think the wording is basically right for the current semantic,
but here's a slight improvement:

The server MUST NOT send a non-error reply to NBD_CMD_FLUSH
until it has ensured that the contents of all writes to which it
has already completed (i.e. replied to) have been persisted
to non-volatile storage.

However, given that the replies can subsequently be reordered, I
now think we do have a problem, as the client can't tell to which
those refer.

> [...]
>> I don't think there is actually a problem here - Wouter if I'm wrong
>> about this, I'd like to understand your argument better.
>
> No, I now see that there isn't, and I misunderstood things. However, I
> do think we should update the spec to clarify this.

Haha - I now think there is. You accidentally convinced me!

>> b) What I'm describing - which is the lack of synchronisation between
>> channels.
> [... long explanation snipped...]
>
> Yes, and I acknowledge that. However, I think that should not be a
> blocker. It's fine to mark this feature as experimental; it will not
> ever be required to use multiple connections to connect to a server.
>
> When this feature lands in nbd-client, I plan to ensure that the man
> page and -help output says something along the following lines:
>
> use N connections to connect to the NBD server, improving performance
> at the cost of a possible loss of reliability.

So in essence we are relying on (userspace) nbd-client not to open
more connections if it's unsafe? IE we can sort out all the negotiation
of whether it's safe or unsafe within userspace and not bother Josef
about it? I suppose that's fine in that we can at least shorten
the CC: line, but I still think it would be helpful if the protocol

>> Now, in the reference server, NBD_CMD_FLUSH is implemented through an
>> fdatasync().
>
> Actually, no, the reference server uses fsync() for reasons that I've
> forgotten (side note: you wrote it that way ;-)
>
> [...]

I vaguely remember why - something to do with files expanding when
holes were written to. However, I don't think that makes much
difference to the question I asked, or at most s/fdatasync/fsync/g

--
Alex Bligh




2016-10-06 09:05:01

by Wouter Verhelst

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

Hi Alex,

On Tue, Oct 04, 2016 at 10:35:03AM +0100, Alex Bligh wrote:
> Wouter,
> > I see now that it should be closer
> > to the former; a more useful definition is probably something along the
> > following lines:
> >
> > All write commands (that includes NBD_CMD_WRITE and NBD_CMD_TRIM)
> > for which a reply was received on the client side prior to the
>
> No, that's wrong as the server has no knowledge of whether the client
> has actually received them so no way of knowing to which writes that
> would reply.

I realise that, but I don't think it's a problem.

In the current situation, a client could opportunistically send a number
of write requests immediately followed by a flush and hope for the best.
However, in that case there is no guarantee that for the write requests
that the client actually cares about to have hit the disk, a reply
arrives on the client side before the flush reply arrives. If that
doesn't happen, that would then mean the client would have to issue
another flush request, probably at a performance hit.

As I understand Christoph's explanations, currently the Linux kernel
*doesn't* issue flush requests unless and until the necessary writes
have already completed (i.e., the reply has been received and processed
on the client side). Given that, given the issue in the previous
paragraph, and given the uncertainty introduced with multiple
connections, I think it is reasonable to say that a client should just
not assume a flush touches anything except for the writes for which it
has already received a reply by the time the flush request is sent out.

Those are semantics that are actually useful and can be guaranteed in
the face of multiple connections. Other semantics can not.

It is indeed impossible for a server to know what has been received by
the client by the time it (the client) sent out the flush request.
However, the server doesn't need that information, at all. The flush
request's semantics do not say that any request not covered by the flush
request itself MUST NOT have hit disk; instead, it just says that there
is no guarantee on whether or not that is the case. That's fine; all a
server needs to know is that when it receives a flush, it needs to
fsync() or some such, and then send the reply. All a *client* needs to
know is which requests have most definitely hit the disk. In my
proposal, those are the requests that finished before the flush request
was sent, and not the requests that finished between that and when the
flush reply is received. Those are *likely* to also be covered
(especially on single-connection NBD setups), but in my proposal,
they're no longer *guaranteed* to be.

Christoph: just to double-check: would such semantics be incompatible
with the semantics that the Linux kernel expects of block devices? If
so, we'll have to review. Otherwise, I think we should go with that.

[...]
> >> b) What I'm describing - which is the lack of synchronisation between
> >> channels.
> > [... long explanation snipped...]
> >
> > Yes, and I acknowledge that. However, I think that should not be a
> > blocker. It's fine to mark this feature as experimental; it will not
> > ever be required to use multiple connections to connect to a server.
> >
> > When this feature lands in nbd-client, I plan to ensure that the man
> > page and -help output says something along the following lines:
> >
> > use N connections to connect to the NBD server, improving performance
> > at the cost of a possible loss of reliability.
>
> So in essence we are relying on (userspace) nbd-client not to open
> more connections if it's unsafe? IE we can sort out all the negotiation
> of whether it's safe or unsafe within userspace and not bother Josef
> about it?

Yes, exactly.

> I suppose that's fine in that we can at least shorten the CC: line,
> but I still think it would be helpful if the protocol

unfinished sentence here...

--
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
people in the world who think they really understand all of its rules,
and pretty much all of them are just lying to themselves too.
-- #debian-devel, OFTC, 2016-02-12

2016-10-06 09:42:01

by Alex Bligh

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

Wouter,

> On 6 Oct 2016, at 10:04, Wouter Verhelst <[email protected]> wrote:
>
> Hi Alex,
>
> On Tue, Oct 04, 2016 at 10:35:03AM +0100, Alex Bligh wrote:
>> Wouter,
>>> I see now that it should be closer
>>> to the former; a more useful definition is probably something along the
>>> following lines:
>>>
>>> All write commands (that includes NBD_CMD_WRITE and NBD_CMD_TRIM)
>>> for which a reply was received on the client side prior to the
>>
>> No, that's wrong as the server has no knowledge of whether the client
>> has actually received them so no way of knowing to which writes that
>> would reply.
>
> I realise that, but I don't think it's a problem.
>
> In the current situation, a client could opportunistically send a number
> of write requests immediately followed by a flush and hope for the best.
> However, in that case there is no guarantee that for the write requests
> that the client actually cares about to have hit the disk, a reply
> arrives on the client side before the flush reply arrives. If that
> doesn't happen, that would then mean the client would have to issue
> another flush request, probably at a performance hit.

Sure, but the client knows (currently) that any write request which
it has a reply to before it receives the reply from the flush request
has been written to disk. Such a client might simply note whether it
has issued any subsequent write requests.

> As I understand Christoph's explanations, currently the Linux kernel
> *doesn't* issue flush requests unless and until the necessary writes
> have already completed (i.e., the reply has been received and processed
> on the client side).

Sure, but it is not the only client.

> Given that, given the issue in the previous
> paragraph, and given the uncertainty introduced with multiple
> connections, I think it is reasonable to say that a client should just
> not assume a flush touches anything except for the writes for which it
> has already received a reply by the time the flush request is sent out.

OK. So you are proposing weakening the semantic for flush (saying that
it is only guaranteed to cover those writes for which the client has
actually received a reply prior to sending the flush, as opposed to
prior to receiving the flush reply). This is based on the view that
the Linux kernel client wouldn't be affected, and if other clients
were affected, their behaviour would be 'somewhat unusual'.

We do have one significant other client out there that uses flush
which is Qemu. I think we should get a view on whether they would be
affected.

> Those are semantics that are actually useful and can be guaranteed in
> the face of multiple connections. Other semantics can not.

Well there is another semantic which would work just fine, and also
cures the other problem (synchronisation between channels) which would
be simply that flush is only guaranteed to affect writes issued on the
same channel. Then flush would do the natural thing, i.e. flush
all the writes that had been done *on that channel*.

> It is indeed impossible for a server to know what has been received by
> the client by the time it (the client) sent out the flush request.
> However, the server doesn't need that information, at all. The flush
> request's semantics do not say that any request not covered by the flush
> request itself MUST NOT have hit disk; instead, it just says that there
> is no guarantee on whether or not that is the case. That's fine; all a
> server needs to know is that when it receives a flush, it needs to
> fsync() or some such, and then send the reply. All a *client* needs to
> know is which requests have most definitely hit the disk. In my
> proposal, those are the requests that finished before the flush request
> was sent, and not the requests that finished between that and when the
> flush reply is received. Those are *likely* to also be covered
> (especially on single-connection NBD setups), but in my proposal,
> they're no longer *guaranteed* to be.

I think my objection was more that you were writing mandatory language
for a server's behaviour based on what the client perceives.

What you are saying from the client's point of view is that it under
your proposed change it can only rely on that writes in respect of
which the reply has been received prior to issuing the flush are persisted
to disk (more might be persisted, but the client can't rely on it).

So far so good.

However, I don't think you can usefully make the guarantee weaker from the
SERVER'S point of view, because it doesn't know how things got reordered.
IE it still needs to persist to disk any write that it has completed
when it processes the flush. Yes, the client doesn't get the same guarantee,
but it can't know whether it can be slacker about a particular write
which it has performed but for which the client didn't receive the reply
prior to issuing the flush - it must just assume that if it did send
the reply prior to issuing the flush (or even queue it to be sent) then
it MIGHT have arrived prior to the flush being issued.

IE I don't actually think the wording from the server side needs changing
now I see what you are trying to do. Just we need a new paragraph saying
what the client can and cannot reply on.

> Christoph: just to double-check: would such semantics be incompatible
> with the semantics that the Linux kernel expects of block devices? If
> so, we'll have to review. Otherwise, I think we should go with that.

It would also really be nice to know whether there is any way the
flushes could be linked to the channel(s) containing the writes to which
they belong - this would solve the issues with coherency between channels.

Equally no one has answered the question as to whether fsync/fdatasync
is guaranteed (especially when not on Linux, not on a block FS) to give
synchronisation when different processes have different FDs open on the
same file. Is there some way to detect when this is safe?

>
> [...]
>>>> b) What I'm describing - which is the lack of synchronisation between
>>>> channels.
>>> [... long explanation snipped...]
>>>
>>> Yes, and I acknowledge that. However, I think that should not be a
>>> blocker. It's fine to mark this feature as experimental; it will not
>>> ever be required to use multiple connections to connect to a server.
>>>
>>> When this feature lands in nbd-client, I plan to ensure that the man
>>> page and -help output says something along the following lines:
>>>
>>> use N connections to connect to the NBD server, improving performance
>>> at the cost of a possible loss of reliability.
>>
>> So in essence we are relying on (userspace) nbd-client not to open
>> more connections if it's unsafe? IE we can sort out all the negotiation
>> of whether it's safe or unsafe within userspace and not bother Josef
>> about it?
>
> Yes, exactly.
>
>> I suppose that's fine in that we can at least shorten the CC: line,
>> but I still think it would be helpful if the protocol
>
> unfinished sentence here...

... but I still think it would be helpful if the protocol helped out
the end user of the client and refused to negotiate multichannel
connections when they are unsafe. How is the end client meant to know
whether the back end is not on Linux, not on a block device, done
via a Ceph driver etc?

I still think it's pretty damn awkward that with a ceph back end
(for instance) which would be one of the backends to benefit the
most from multichannel connections (as it's inherently parallel),
no one has explained how flush could be done safely.

--
Alex Bligh




2016-10-06 10:15:54

by Wouter Verhelst

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

On Thu, Oct 06, 2016 at 10:41:36AM +0100, Alex Bligh wrote:
> Wouter,
[...]
> > Given that, given the issue in the previous
> > paragraph, and given the uncertainty introduced with multiple
> > connections, I think it is reasonable to say that a client should just
> > not assume a flush touches anything except for the writes for which it
> > has already received a reply by the time the flush request is sent out.
>
> OK. So you are proposing weakening the semantic for flush (saying that
> it is only guaranteed to cover those writes for which the client has
> actually received a reply prior to sending the flush, as opposed to
> prior to receiving the flush reply). This is based on the view that
> the Linux kernel client wouldn't be affected, and if other clients
> were affected, their behaviour would be 'somewhat unusual'.

Right.

> We do have one significant other client out there that uses flush
> which is Qemu. I think we should get a view on whether they would be
> affected.

That's certainly something to consider, yes.

> > Those are semantics that are actually useful and can be guaranteed in
> > the face of multiple connections. Other semantics can not.
>
> Well there is another semantic which would work just fine, and also
> cures the other problem (synchronisation between channels) which would
> be simply that flush is only guaranteed to affect writes issued on the
> same channel. Then flush would do the natural thing, i.e. flush
> all the writes that had been done *on that channel*.

That is an option, yes, but the natural result will be that you issue N
flush requests, rather than one, which I'm guessing will kill
performance. Therefore, I'd prefer not to go down that route.

[...]
> > It is indeed impossible for a server to know what has been received by
> > the client by the time it (the client) sent out the flush request.
> > However, the server doesn't need that information, at all. The flush
> > request's semantics do not say that any request not covered by the flush
> > request itself MUST NOT have hit disk; instead, it just says that there
> > is no guarantee on whether or not that is the case. That's fine; all a
> > server needs to know is that when it receives a flush, it needs to
> > fsync() or some such, and then send the reply. All a *client* needs to
> > know is which requests have most definitely hit the disk. In my
> > proposal, those are the requests that finished before the flush request
> > was sent, and not the requests that finished between that and when the
> > flush reply is received. Those are *likely* to also be covered
> > (especially on single-connection NBD setups), but in my proposal,
> > they're no longer *guaranteed* to be.
>
> I think my objection was more that you were writing mandatory language
> for a server's behaviour based on what the client perceives.
>
> What you are saying from the client's point of view is that it under
> your proposed change it can only rely on that writes in respect of
> which the reply has been received prior to issuing the flush are persisted
> to disk (more might be persisted, but the client can't rely on it).

Exactly.

[...]
> IE I don't actually think the wording from the server side needs changing
> now I see what you are trying to do. Just we need a new paragraph saying
> what the client can and cannot reply on.

That's obviously also a valid option. I'm looking forward to your
proposed wording then :-)

[...]
> >> I suppose that's fine in that we can at least shorten the CC: line,
> >> but I still think it would be helpful if the protocol
> >
> > unfinished sentence here...
>
> .... but I still think it would be helpful if the protocol helped out
> the end user of the client and refused to negotiate multichannel
> connections when they are unsafe. How is the end client meant to know
> whether the back end is not on Linux, not on a block device, done
> via a Ceph driver etc?

Well, it isn't. The server, if it provides certain functionality, should
also provide particular guarantees. If it can't provide those
guarantees, it should not provide that functionality.

e.g., if a server runs on a backend with cache coherency issues, it
should not allow multiple connections to the same device, etc.

> I still think it's pretty damn awkward that with a ceph back end
> (for instance) which would be one of the backends to benefit the
> most from multichannel connections (as it's inherently parallel),
> no one has explained how flush could be done safely.

If ceph doesn't have any way to guarantee that a write is available to
all readers of a particular device, then it *cannot* be used to map
block device semantics with multiple channels. Therefore, it should not
allow writing to the device from multiple clients, period, unless the
filesystem (or other thing) making use of the nbd device above the ceph
layer actually understands how things may go wrong and can take care of
it.

As such, I don't think that the problems inherent in using multiple
connections to a ceph device (which I do not deny) have any place in a
discussion on how NBD should work in the face of multiple channels with
a sane/regular backend.

--
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
people in the world who think they really understand all of its rules,
and pretty much all of them are just lying to themselves too.
-- #debian-devel, OFTC, 2016-02-12

2016-10-06 10:32:03

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

On Thu, Oct 06, 2016 at 11:04:15AM +0200, Wouter Verhelst wrote:
> In the current situation, a client could opportunistically send a number
> of write requests immediately followed by a flush and hope for the best.
> However, in that case there is no guarantee that for the write requests
> that the client actually cares about to have hit the disk, a reply
> arrives on the client side before the flush reply arrives. If that
> doesn't happen, that would then mean the client would have to issue
> another flush request, probably at a performance hit.

There is also no guarantee that the server would receive them in order.

Note that people looked into schemes like this multiple times using
a SCSI feature called ordered tags which should provide this sort
of ordering, but no one managed to make it work reliably.

> As I understand Christoph's explanations, currently the Linux kernel
> *doesn't* issue flush requests unless and until the necessary writes
> have already completed (i.e., the reply has been received and processed
> on the client side). Given that, given the issue in the previous
> paragraph, and given the uncertainty introduced with multiple
> connections, I think it is reasonable to say that a client should just
> not assume a flush touches anything except for the writes for which it
> has already received a reply by the time the flush request is sent out.

Exactly. That's the wording in other protocol specifications, and the
semantics Linux (and Windows) rely on.

> Christoph: just to double-check: would such semantics be incompatible
> with the semantics that the Linux kernel expects of block devices? If
> so, we'll have to review. Otherwise, I think we should go with that.

No, they match the cache flush semantics in every other storage protocol
known to me, and they match the expectations of both the Linux kernel
and any other OS or comsumer I know about perfectly.

2016-10-06 11:05:11

by Alex Bligh

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support


> On 6 Oct 2016, at 11:15, Wouter Verhelst <[email protected]> wrote:
>
>>
>>
>> .... but I still think it would be helpful if the protocol helped out
>> the end user of the client and refused to negotiate multichannel
>> connections when they are unsafe. How is the end client meant to know
>> whether the back end is not on Linux, not on a block device, done
>> via a Ceph driver etc?
>
> Well, it isn't. The server, if it provides certain functionality, should
> also provide particular guarantees. If it can't provide those
> guarantees, it should not provide that functionality.
>
> e.g., if a server runs on a backend with cache coherency issues, it
> should not allow multiple connections to the same device, etc.

Sure. I'm simply saying that the connection flags should say "I can't
support multiple connections to this device" (available at
NBD_OPT_INFO time) rather than errorring out. This is a userspace
protocol issue.

>> I still think it's pretty damn awkward that with a ceph back end
>> (for instance) which would be one of the backends to benefit the
>> most from multichannel connections (as it's inherently parallel),
>> no one has explained how flush could be done safely.
>
> If ceph doesn't have any way to guarantee that a write is available to
> all readers of a particular device, then it *cannot* be used to map
> block device semantics with multiple channels.

Thinking about it I believe Ceph actually may be able to do that,
it's just harder than a straightforward flush.

> Therefore, it should not
> allow writing to the device from multiple clients, period, unless the
> filesystem (or other thing) making use of the nbd device above the ceph
> layer actually understands how things may go wrong and can take care of
> it.
>
> As such, I don't think that the problems inherent in using multiple
> connections to a ceph device (which I do not deny) have any place in a
> discussion on how NBD should work in the face of multiple channels with
> a sane/regular backend.

On which note, I am still not convinced that fsync() provides such
semantics on all operating systems and on Linux on non-block devices.
I'm not sure all those backends are 'insane'! However, if the server
could signal lack of support for multiple connections (see above)
my concerns would be significantly reduced. Note his requires no
kernel change (as you pointed out).

--
Alex Bligh




2016-10-06 13:10:44

by Wouter Verhelst

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

On Thu, Oct 06, 2016 at 03:31:55AM -0700, Christoph Hellwig wrote:
> No, they match the cache flush semantics in every other storage protocol
> known to me, and they match the expectations of both the Linux kernel
> and any other OS or comsumer I know about perfectly.

Okay, I've updated the proto.md file then, to clarify that in the case
of multiple connections, a client MUST NOT send a flush request until it
has seen the replies to the write requests that it cares about. That
should be enough for now.

Thanks,

--
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
people in the world who think they really understand all of its rules,
and pretty much all of them are just lying to themselves too.
-- #debian-devel, OFTC, 2016-02-12

2016-10-06 13:16:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

On Thu, Oct 06, 2016 at 03:09:49PM +0200, Wouter Verhelst wrote:
> Okay, I've updated the proto.md file then, to clarify that in the case
> of multiple connections, a client MUST NOT send a flush request until it
> has seen the replies to the write requests that it cares about. That
> should be enough for now.

How do you guarantee that nothing has been reordered or even lost even for
a single connection?

2016-10-06 13:56:22

by Wouter Verhelst

[permalink] [raw]
Subject: Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

On Thu, Oct 06, 2016 at 06:16:30AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 06, 2016 at 03:09:49PM +0200, Wouter Verhelst wrote:
> > Okay, I've updated the proto.md file then, to clarify that in the case
> > of multiple connections, a client MUST NOT send a flush request until it
> > has seen the replies to the write requests that it cares about. That
> > should be enough for now.
>
> How do you guarantee that nothing has been reordered or even lost even for
> a single connection?

In the case of a single connection, we already stated that the flush
covers the write requests for which a reply has already been sent out by
the time the flush reply is sent out. On a single connection, there is
no way an implementation can comply with the old requirement but not the
new one.

We do not guarantee any ordering beyond that; and lost requests would be
a bug in the server.

--
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
people in the world who think they really understand all of its rules,
and pretty much all of them are just lying to themselves too.
-- #debian-devel, OFTC, 2016-02-12