LinuxLists.cc - Correctly understanding Linux's close-to-open consistency

2018-09-13 06:32:07

Subject: Correctly understanding Linux's close-to-open consistency

I'm trying to get my head around the officially proper way of
writing to NFS files (not just what works today, and what I think
is supposed to work, since I was misunderstanding things about that
recently).

Is it correct to say that when writing data to NFS files, the only
sequence of operations that Linux NFS clients officially support is
the following:

- all processes on all client machines close() the file
- one machine (a client or the fileserver) opens() the file, writes
to it, and close()s again
- processes on client machines can now open() the file again for
reading

Other sequences of operations may work in some particular kernel version
or under some circumstances, but are not guaranteed to work over kernel
version changes or in general.

In an official 'we guarantee that if you do this, things will work' sense,
how does taking NFS locks interact with this required sequence? Do NFS
locks make some part of it unnecessary, or does it remain necessary and
NFS locks are just there to let you coordinate who has a magic 'you can
write' token and you still officially need to close and open and so on?

Thanks in advance.

- cks

2018-09-15 21:39:39

by Jeff Layton

[permalink] [raw]

Subject: Re: Correctly understanding Linux's close-to-open consistency

On Wed, 2018-09-12 at 21:24 -0400, Chris Siebenmann wrote:
> I'm trying to get my head around the officially proper way of
> writing to NFS files (not just what works today, and what I think
> is supposed to work, since I was misunderstanding things about that
> recently).
>
> Is it correct to say that when writing data to NFS files, the only
> sequence of operations that Linux NFS clients officially support is
> the following:
>
> - all processes on all client machines close() the file
> - one machine (a client or the fileserver) opens() the file, writes
> to it, and close()s again
> - processes on client machines can now open() the file again for
> reading

No.

One can always call fsync() to force data to be flushed to avoid the
close of the write fd in this situation. That's really a more portable
solution anyway. A local filesystem may not flush data to disk, on close
(for instance) so calling fsync will ensure you rely less on filesystem
implementation details.

The separate open by the reader just helps ensure that the file's
attributes are revalidated (so you can tell whether cached data you hold
is still valid).

> Other sequences of operations may work in some particular kernel version
> or under some circumstances, but are not guaranteed to work over kernel
> version changes or in general.
>

The NFS client (and the Linux kernel in general) will try to preserve as
much cached data as it can, but eventually it will end up being freed,
depending on the kernel's memory requirements. This is not behavior you
want to depend on, as an application developer.

> In an official 'we guarantee that if you do this, things will work' sense,
> how does taking NFS locks interact with this required sequence? Do NFS
> locks make some part of it unnecessary, or does it remain necessary and
> NFS locks are just there to let you coordinate who has a magic 'you can
> write' token and you still officially need to close and open and so on?
>

If you use file locking (flock() or POSIX locks), then we treat those as
cache coherency points as well. The client will write back cached data
to the server prior to releasing a lock, and revalidate attributes (and
thus the local cache) after acquiring one.

If you have an application that does concurrent access via NFS over
multiple machines, then you probably want to be using file locking to
serialize things across machines.
--
Jeff Layton <[email protected]>

2018-09-16 00:31:05

by Chris Siebenmann

[permalink] [raw]

Subject: Re: Correctly understanding Linux's close-to-open consistency

> On Wed, 2018-09-12 at 21:24 -0400, Chris Siebenmann wrote:
> > Is it correct to say that when writing data to NFS files, the only
> > sequence of operations that Linux NFS clients officially support is
> > the following:
> >
> > - all processes on all client machines close() the file
> > - one machine (a client or the fileserver) opens() the file, writes
> > to it, and close()s again
> > - processes on client machines can now open() the file again for
> > reading
>
> No.
>
> One can always call fsync() to force data to be flushed to avoid the
> close of the write fd in this situation. That's really a more portable
> solution anyway. A local filesystem may not flush data to disk, on close
> (for instance) so calling fsync will ensure you rely less on filesystem
> implementation details.
>
> The separate open by the reader just helps ensure that the file's
> attributes are revalidated (so you can tell whether cached data you
> hold is still valid).

This bit about the separate open doesn't seem to be the case
currently, and people here have asserted that it's not true in
general. Specifically, under some conditions *not involving you
writing*, if you do not close() the file before another machine writes
to it and then open() it afterward, the kernel may retain cached data
that it is in a position to know (for sure) is invalid because it didn't
exist in the previous version of the file (as it was past the end of
file position).

Since failing to close() before another machine open()s puts you
outside this outline of close-to-open, this kernel behavior is not a
bug as such (or so it's been explained to me here). If you go outside
c-t-o, the kernel is free to do whatever it finds most convenient, and
what it found most convenient was to not bother invalidating some cached
page data even though it saw a GETATTR change.

It may be that I'm not fully understanding how you mean 'revalidated'
here. Is it that the kernel does not necessarily bother (re)checking
some internal things (such as cached pages) even when it has new GETATTR
results, until you do certain operations?

As far as the writer using fsync() instead of close(): under this
model, the writer must close() if there are ever going to be writers
on another machine and readers on its machine (including itself),
because otherwise it (and they) will be in the 'reader' position here,
and in violation of the outline, and so their client kernel is free to
do odd things. (This is a basic model that ignores how NFS locks might
interact with things.)

> If you use file locking (flock() or POSIX locks), then we treat
> those as cache coherency points as well. The client will write back
> cached data to the server prior to releasing a lock, and revalidate
> attributes (and thus the local cache) after acquiring one.

The client currently appears to do more than re-check attributes,
at least in one sense of 'revalidate'. In some cases, flock() will
cause the client to flush cached data that it would otherwise return and
apparently considered valid, even though GETATTR results from the server
didn't change. I'm curious if this is guaranteed behavior, or simply
'it works today'.

(If by 'revalidate attributes' you mean that the kernel internally
revalidates some cached data that it didn't bother revalidating before,
then that would match observed behavior. As an outside user of NFS,
I find this confusing terminology, though, as the kernel clearly has
new GETATTR results.)

Specifically, consider the sequence:

client A fileserver
open file read-write
read through end of file
1 go idle, but don't close file
2 open file, append data, close, sync

3 remain idle until fstat() shows st_size has grown

4 optional: close and re-open file
5 optional: flock()

6 read from old EOF to new EOF

Today, if you leave out #5, at #6 client A will read some zero bytes
instead of actual file content (whether or not you did #4). If you
include #5, it will not (again whether or not you did #4).

Under my outline in my original email, client A is behaving outside
of close to open consistency because it has not closed the file before
the fileserver wrote to it and opened it afterward. At point #3, in some
sense the client clearly knows that file attributes have changed, because
fstat() results have changed (showing a new, larger file size among other
things), but because we went outside the guaranteed behavior the kernel
doesn't have to care completely; it retains a cached partial page at the
old end of file and returns this data to us at step #6 (if we skip #5).

The file attributes obtained from the NFS server don't change between
#3, #4, and #5, but if we do #5, today the kernel does something with
the cached partial page that causes it to return real data at #6. This
doesn't happen with just #4, but under my outlined rules that's acceptable
because we violated c-t-o by closing the file only after it had been
changed elsewhere and so the kernel isn't obliged to do the magic that
it does for #5.

(In fact it is possible to read zero bytes before #5 and read good data
afterward, including in a different program.)

- cks

2018-09-16 16:23:39

by Jeff Layton

[permalink] [raw]

Subject: Re: Correctly understanding Linux's close-to-open consistency

On Sat, 2018-09-15 at 15:11 -0400, Chris Siebenmann wrote:
> > On Wed, 2018-09-12 at 21:24 -0400, Chris Siebenmann wrote:
> > > Is it correct to say that when writing data to NFS files, the only
> > > sequence of operations that Linux NFS clients officially support is
> > > the following:
> > >
> > > - all processes on all client machines close() the file
> > > - one machine (a client or the fileserver) opens() the file, writes
> > > to it, and close()s again
> > > - processes on client machines can now open() the file again for
> > > reading
> >
> > No.
> >
> > One can always call fsync() to force data to be flushed to avoid the
> > close of the write fd in this situation. That's really a more portable
> > solution anyway. A local filesystem may not flush data to disk, on close
> > (for instance) so calling fsync will ensure you rely less on filesystem
> > implementation details.
> >
> > The separate open by the reader just helps ensure that the file's
> > attributes are revalidated (so you can tell whether cached data you
> > hold is still valid).
>
> This bit about the separate open doesn't seem to be the case
> currently, and people here have asserted that it's not true in
> general. Specifically, under some conditions *not involving you
> writing*, if you do not close() the file before another machine writes
> to it and then open() it afterward, the kernel may retain cached data
> that it is in a position to know (for sure) is invalid because it didn't
> exist in the previous version of the file (as it was past the end of
> file position).
>
> Since failing to close() before another machine open()s puts you
> outside this outline of close-to-open, this kernel behavior is not a
> bug as such (or so it's been explained to me here). If you go outside
> c-t-o, the kernel is free to do whatever it finds most convenient, and
> what it found most convenient was to not bother invalidating some cached
> page data even though it saw a GETATTR change.
>

That would be a bug. If we have reason to believe the file has changed,
then we must invalidate the cache on the file prior to allowing a read
to proceed.

> It may be that I'm not fully understanding how you mean 'revalidated'
> here. Is it that the kernel does not necessarily bother (re)checking
> some internal things (such as cached pages) even when it has new GETATTR
> results, until you do certain operations?
>

Well, it'll generally mark the cache as being invalid (e.g.
NFS_INO_INVALID_DATA flag). Whether it purges the cache at that point is
a different matter. If we have writes cached, then we can't just drop
pages that have dirty data. They must be written back to the server
first.

Basically, if you don't take steps to serialize your I/O between hosts,
then your results may not be what you expect.

> As far as the writer using fsync() instead of close(): under this
> model, the writer must close() if there are ever going to be writers
> on another machine and readers on its machine (including itself),
> because otherwise it (and they) will be in the 'reader' position here,
> and in violation of the outline, and so their client kernel is free to
> do odd things. (This is a basic model that ignores how NFS locks might
> interact with things.)
>

A close() on NFS is basically doing fsync() and then close(), unless you
hold a write delegation, in which case it may not do the fsync since
it's not required.

> > If you use file locking (flock() or POSIX locks), then we treat
> > those as cache coherency points as well. The client will write back
> > cached data to the server prior to releasing a lock, and revalidate
> > attributes (and thus the local cache) after acquiring one.
>
> The client currently appears to do more than re-check attributes,
> at least in one sense of 'revalidate'. In some cases, flock() will
> cause the client to flush cached data that it would otherwise return and
> apparently considered valid, even though GETATTR results from the server
> didn't change. I'm curious if this is guaranteed behavior, or simply
> 'it works today'.
>

You need to distinguish between two different cases in the cache here.
Pages can be dirty or clean. When I say flush here, I mean that it's
writing back dirty data.

The client can decide to drop clean pages at any time. It doesn't need a
reason -- being low on memory is good enough.

> (If by 'revalidate attributes' you mean that the kernel internally
> revalidates some cached data that it didn't bother revalidating before,
> then that would match observed behavior. As an outside user of NFS,
> I find this confusing terminology, though, as the kernel clearly has
> new GETATTR results.)
>
> Specifically, consider the sequence:
>
> client A fileserver
> open file read-write
> read through end of file
> 1 go idle, but don't close file
> 2 open file, append data, close, sync
>
> 3 remain idle until fstat() shows st_size has grown
>
> 4 optional: close and re-open file
> 5 optional: flock()
>
> 6 read from old EOF to new EOF
>
> Today, if you leave out #5, at #6 client A will read some zero bytes
> instead of actual file content (whether or not you did #4). If you
> include #5, it will not (again whether or not you did #4).
>
> Under my outline in my original email, client A is behaving outside
> of close to open consistency because it has not closed the file before
> the fileserver wrote to it and opened it afterward. At point #3, in some
> sense the client clearly knows that file attributes have changed, because
> fstat() results have changed (showing a new, larger file size among other
> things), but because we went outside the guaranteed behavior the kernel
> doesn't have to care completely; it retains a cached partial page at the
> old end of file and returns this data to us at step #6 (if we skip #5).
>
> The file attributes obtained from the NFS server don't change between
> #3, #4, and #5, but if we do #5, today the kernel does something with
> the cached partial page that causes it to return real data at #6. This
> doesn't happen with just #4, but under my outlined rules that's acceptable
> because we violated c-t-o by closing the file only after it had been
> changed elsewhere and so the kernel isn't obliged to do the magic that
> it does for #5.
>
> (In fact it is possible to read zero bytes before #5 and read good data
> afterward, including in a different program.)
>
>

Sure. As I said before, locking acts as cache coherency points. On
flock, we would revalidate the attributes so it would see the new size
and do reads like you'd expect.

As complicated as CTO sounds, it's actually relatively simple. When we
close a file, we flush any cached write data back to the server
(basically doing an fsync). When we open a file, we revalidate the
attributes to ensure that we know whether the cache is valid. We do
similar things with locking (releasing a lock flushes cached data, and
acquiring one revalidates attributes).

The client however is free to flush data at any time and fetch
attributes at any time. YMMV if changes happened to the file after you
locked or opened it, or if someone performs reads prior to your unlock
or close. If you want consistent reads and writes then you _must_ ensure
that the accesses are serialized. Usually that's done with locking but
it doesn't have to be if you can serialize open/close/fsync via other
mechanisms.

Basically, your assertion was that you _must_ open and close files in
order to get proper cache coherency between clients doing reads and
writes. That's simply not true if you use file locking. If you've found
cases where file locks are not protecting things as they should then
please do raise a bug report.

It's also not required to close the file that was open for write if you
do an fsync prior to the reader reopening the file. The close is
completely extraneous at that point since you know that writeback is
complete. The reopen for read in that case is only required in order to
ensure that the attrs are re-fetched prior to trusting the reader's
cache.

--
Jeff Layton <[email protected]>

2018-09-16 21:36:12

by Trond Myklebust

[permalink] [raw]

Subject: Re: Correctly understanding Linux's close-to-open consistency

T24gU3VuLCAyMDE4LTA5LTE2IGF0IDA3OjAxIC0wNDAwLCBKZWZmIExheXRvbiB3cm90ZToNCj4g
T24gU2F0LCAyMDE4LTA5LTE1IGF0IDE1OjExIC0wNDAwLCBDaHJpcyBTaWViZW5tYW5uIHdyb3Rl
Og0KPiA+ID4gT24gV2VkLCAyMDE4LTA5LTEyIGF0IDIxOjI0IC0wNDAwLCBDaHJpcyBTaWViZW5t
YW5uIHdyb3RlOg0KPiA+ID4gPiAgSXMgaXQgY29ycmVjdCB0byBzYXkgdGhhdCB3aGVuIHdyaXRp
bmcgZGF0YSB0byBORlMgZmlsZXMsIHRoZQ0KPiA+ID4gPiBvbmx5DQo+ID4gPiA+IHNlcXVlbmNl
IG9mIG9wZXJhdGlvbnMgdGhhdCBMaW51eCBORlMgY2xpZW50cyBvZmZpY2lhbGx5DQo+ID4gPiA+
IHN1cHBvcnQgaXMNCj4gPiA+ID4gdGhlIGZvbGxvd2luZzoNCj4gPiA+ID4gDQo+ID4gPiA+IC0g
YWxsIHByb2Nlc3NlcyBvbiBhbGwgY2xpZW50IG1hY2hpbmVzIGNsb3NlKCkgdGhlIGZpbGUNCj4g
PiA+ID4gLSBvbmUgbWFjaGluZSAoYSBjbGllbnQgb3IgdGhlIGZpbGVzZXJ2ZXIpIG9wZW5zKCkg
dGhlIGZpbGUsDQo+ID4gPiA+IHdyaXRlcw0KPiA+ID4gPiAgIHRvIGl0LCBhbmQgY2xvc2UoKXMg
YWdhaW4NCj4gPiA+ID4gLSBwcm9jZXNzZXMgb24gY2xpZW50IG1hY2hpbmVzIGNhbiBub3cgb3Bl
bigpIHRoZSBmaWxlIGFnYWluDQo+ID4gPiA+IGZvcg0KPiA+ID4gPiAgIHJlYWRpbmcNCj4gPiA+
IA0KPiA+ID4gTm8uDQo+ID4gPiANCj4gPiA+IE9uZSBjYW4gYWx3YXlzIGNhbGwgZnN5bmMoKSB0
byBmb3JjZSBkYXRhIHRvIGJlIGZsdXNoZWQgdG8gYXZvaWQNCj4gPiA+IHRoZQ0KPiA+ID4gY2xv
c2Ugb2YgdGhlIHdyaXRlIGZkIGluIHRoaXMgc2l0dWF0aW9uLiBUaGF0J3MgcmVhbGx5IGEgbW9y
ZQ0KPiA+ID4gcG9ydGFibGUNCj4gPiA+IHNvbHV0aW9uIGFueXdheS4gQSBsb2NhbCBmaWxlc3lz
dGVtIG1heSBub3QgZmx1c2ggZGF0YSB0byBkaXNrLA0KPiA+ID4gb24gY2xvc2UNCj4gPiA+IChm
b3IgaW5zdGFuY2UpIHNvIGNhbGxpbmcgZnN5bmMgd2lsbCBlbnN1cmUgeW91IHJlbHkgbGVzcyBv
bg0KPiA+ID4gZmlsZXN5c3RlbQ0KPiA+ID4gaW1wbGVtZW50YXRpb24gZGV0YWlscy4NCj4gPiA+
IA0KPiA+ID4gVGhlIHNlcGFyYXRlIG9wZW4gYnkgdGhlIHJlYWRlciBqdXN0IGhlbHBzIGVuc3Vy
ZSB0aGF0IHRoZSBmaWxlJ3MNCj4gPiA+IGF0dHJpYnV0ZXMgYXJlIHJldmFsaWRhdGVkIChzbyB5
b3UgY2FuIHRlbGwgd2hldGhlciBjYWNoZWQgZGF0YQ0KPiA+ID4geW91DQo+ID4gPiBob2xkIGlz
IHN0aWxsIHZhbGlkKS4NCj4gPiANCj4gPiAgVGhpcyBiaXQgYWJvdXQgdGhlIHNlcGFyYXRlIG9w
ZW4gZG9lc24ndCBzZWVtIHRvIGJlIHRoZSBjYXNlDQo+ID4gY3VycmVudGx5LCBhbmQgcGVvcGxl
IGhlcmUgaGF2ZSBhc3NlcnRlZCB0aGF0IGl0J3Mgbm90IHRydWUgaW4NCj4gPiBnZW5lcmFsLiBT
cGVjaWZpY2FsbHksIHVuZGVyIHNvbWUgY29uZGl0aW9ucyAqbm90IGludm9sdmluZyB5b3UNCj4g
PiB3cml0aW5nKiwgaWYgeW91IGRvIG5vdCBjbG9zZSgpIHRoZSBmaWxlIGJlZm9yZSBhbm90aGVy
IG1hY2hpbmUNCj4gPiB3cml0ZXMNCj4gPiB0byBpdCBhbmQgdGhlbiBvcGVuKCkgaXQgYWZ0ZXJ3
YXJkLCB0aGUga2VybmVsIG1heSByZXRhaW4gY2FjaGVkDQo+ID4gZGF0YQ0KPiA+IHRoYXQgaXQg
aXMgaW4gYSBwb3NpdGlvbiB0byBrbm93IChmb3Igc3VyZSkgaXMgaW52YWxpZCBiZWNhdXNlIGl0
DQo+ID4gZGlkbid0DQo+ID4gZXhpc3QgaW4gdGhlIHByZXZpb3VzIHZlcnNpb24gb2YgdGhlIGZp
bGUgKGFzIGl0IHdhcyBwYXN0IHRoZSBlbmQNCj4gPiBvZg0KPiA+IGZpbGUgcG9zaXRpb24pLg0K
PiA+IA0KPiA+ICBTaW5jZSBmYWlsaW5nIHRvIGNsb3NlKCkgYmVmb3JlIGFub3RoZXIgbWFjaGlu
ZSBvcGVuKClzIHB1dHMgeW91DQo+ID4gb3V0c2lkZSB0aGlzIG91dGxpbmUgb2YgY2xvc2UtdG8t
b3BlbiwgdGhpcyBrZXJuZWwgYmVoYXZpb3IgaXMgbm90DQo+ID4gYQ0KPiA+IGJ1ZyBhcyBzdWNo
IChvciBzbyBpdCdzIGJlZW4gZXhwbGFpbmVkIHRvIG1lIGhlcmUpLiAgSWYgeW91IGdvDQo+ID4g
b3V0c2lkZQ0KPiA+IGMtdC1vLCB0aGUga2VybmVsIGlzIGZyZWUgdG8gZG8gd2hhdGV2ZXIgaXQg
ZmluZHMgbW9zdCBjb252ZW5pZW50LA0KPiA+IGFuZA0KPiA+IHdoYXQgaXQgZm91bmQgbW9zdCBj
b252ZW5pZW50IHdhcyB0byBub3QgYm90aGVyIGludmFsaWRhdGluZyBzb21lDQo+ID4gY2FjaGVk
DQo+ID4gcGFnZSBkYXRhIGV2ZW4gdGhvdWdoIGl0IHNhdyBhIEdFVEFUVFIgY2hhbmdlLg0KPiA+
IA0KPiANCj4gVGhhdCB3b3VsZCBiZSBhIGJ1Zy4gSWYgd2UgaGF2ZSByZWFzb24gdG8gYmVsaWV2
ZSB0aGUgZmlsZSBoYXMNCj4gY2hhbmdlZCwNCj4gdGhlbiB3ZSBtdXN0IGludmFsaWRhdGUgdGhl
IGNhY2hlIG9uIHRoZSBmaWxlIHByaW9yIHRvIGFsbG93aW5nIGENCj4gcmVhZA0KPiB0byBwcm9j
ZWVkLg0KDQpUaGUgcG9pbnQgaGVyZSBpcyB0aGF0IHdoZW4gdGhlIGZpbGUgaXMgb3BlbiBmb3Ig
d3JpdGluZyAob3IgZm9yDQpyZWFkK3dyaXRlKSwgYW5kIHlvdXIgYXBwbGljYXRpb25zIGFyZSBu
b3QgdXNpbmcgbG9ja2luZywgdGhlbiB3ZSBoYXZlDQpubyByZWFzb24gdG8gYmVsaWV2ZSB0aGUg
ZmlsZSBpcyBiZWluZyBjaGFuZ2VkIG9uIHRoZSBzZXJ2ZXIsIGFuZCB3ZQ0KZGVsaWJlcmF0ZWx5
IG9wdGltaXNlIGZvciB0aGUgY2FzZSB3aGVyZSB0aGUgY2FjaGUgY29uc2lzdGVuY3kgcnVsZXMN
CmFyZSBiZWluZyBvYnNlcnZlZC4NCg0KSWYgdGhlIGZpbGUgaXMgb3BlbiBmb3IgcmVhZGluZyBv
bmx5LCB0aGVuIHdlIG1heSBkZXRlY3QgY2hhbmdlcyBvbiB0aGUNCnNlcnZlci4gSG93ZXZlciB3
ZSBjZXJ0YWlubHkgY2Fubm90IGd1YXJhbnRlZSB0aGF0IHRoZSBkYXRhIGlzDQpjb25zaXN0ZW50
IGR1ZSB0byB0aGUgcG90ZW50aWFsIGZvciB3cml0ZSByZW9yZGVyaW5nIGFzIGRpc2N1c3NlZA0K
ZWFybGllciBpbiB0aGlzIHRocmVhZCwgYW5kIGR1ZSB0byB0aGUgZmFjdCB0aGF0IGF0dHJpYnV0
ZSByZXZhbGlkYXRpb24NCmlzIG5vdCBhdG9taWMgd2l0aCByZWFkcy4NCg0KQWdhaW4sIHRoZXNl
IGFyZSB0aGUgY2FzZXMgd2hlcmUgeW91IGFyZSBfbm90XyB1c2luZyBsb2NraW5nIHRvDQptZWRp
YXRlLiBJZiB5b3UgYXJlIHVzaW5nIGxvY2tpbmcsIHRoZW4gSSBhZ3JlZSB0aGF0IGNoYW5nZXMg
bmVlZCB0byBiZQ0Kc2VlbiBieSB0aGUgY2xpZW50Lg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QNCkxp
bnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lciwgSGFtbWVyc3BhY2UNCnRyb25kLm15a2xlYnVzdEBo
YW1tZXJzcGFjZS5jb20NCg0KDQo=

2018-09-17 05:43:22

by Chris Siebenmann

[permalink] [raw]

Subject: Re: Correctly understanding Linux's close-to-open consistency

> > > Since failing to close() before another machine open()s puts you
> > > outside this outline of close-to-open, this kernel behavior is
> > > not a bug as such (or so it's been explained to me here). If you
> > > go outside c-t-o, the kernel is free to do whatever it finds most
> > > convenient, and what it found most convenient was to not bother
> > > invalidating some cached page data even though it saw a GETATTR
> > > change.
> >
> > That would be a bug. If we have reason to believe the file has
> > changed, then we must invalidate the cache on the file prior to
> > allowing a read to proceed.
>
> The point here is that when the file is open for writing (or for
> read+write), and your applications are not using locking, then we have
> no reason to believe the file is being changed on the server, and we
> deliberately optimise for the case where the cache consistency rules
> are being observed.

In this case the user level can be completely sure that the client
kernel has issued a GETATTR and received a different answer from the
NFS server, because the fstat() results it sees have changed from the
values it has seen before (and remembered). This may not count as the
NFS client kernel code '[having] reason to believe' that the file has
changed on the server from its perspective, but if so it's not because
the information is not available and a GETATTR would have to be explicitly
issued to find it out. The client code has made the GETATTR and received
different results, which it has passed to user level; it has just not
used those results to do things to its cached data.

Today, if you do a flock(), the NFS client code in the kernel will
do things that invalidate the cached data, despite the GETATTR result
from the fileserver not changing. From my outside perspective, as someone
writing code or dealing with programs that must work over NFS, this is a
little bit magical, and as a result I would like to understand if it is
guaranteed that the magic works or if this is not officially supported
magic, merely 'it happens to work' magic in the way that having the
file open read-write without the flock() used to work in kernel 4.4.x
but doesn't now (and this is simply considered to be the kernel using
CTO more strongly, not a bug).

(Looking at a tcpdump trace, the flock() call appears to cause the kernel
to issue another GETATTR to the fileserver. The results are the same as
the GETATTR results that were passed to the client program.)

> Again, these are the cases where you are _not_ using locking to
> mediate. If you are using locking, then I agree that changes need to
> be seen by the client.

The original code (Alpine) *is* using locking in the broad sense,
but it is not flock() locking; instead it is locking (in this case)
through .lock files. The current kernel behavior and what I've been
told about it implies that it is not sufficient for your application to
perfectly coordinate locking, writes, fsync(), and fstat() visibility
of the resulting changes through its own mechanism; you must do your
locking through the officially approved kernel channels (and it is not
clear what they are) or see potentially incorrect results.

Consider a system where reads and writes to a shared file are
coordinated by a central process that everyone communicates with through
TCP connections. The central process pauses readers before it allows
a writer to start, the writer always fsync()s before it releases its
write permissions, and then no reader is permitted to proceed until the
entire cluster sees the same updated fstat() result. This is perfectly
coordinated but currently could see incorrect read() results, and I've
been told that this is allowed under Linux's CTO rules because all of
the processes hold the file open read-write through this entire process
(and no one flock()s).

- cks

2018-09-17 07:44:20

by Trond Myklebust

[permalink] [raw]

Subject: Re: Correctly understanding Linux's close-to-open consistency

T24gU3VuLCAyMDE4LTA5LTE2IGF0IDIwOjE4IC0wNDAwLCBDaHJpcyBTaWViZW5tYW5uIHdyb3Rl
Og0KPiA+ID4gPiAgU2luY2UgZmFpbGluZyB0byBjbG9zZSgpIGJlZm9yZSBhbm90aGVyIG1hY2hp
bmUgb3BlbigpcyBwdXRzDQo+ID4gPiA+IHlvdQ0KPiA+ID4gPiBvdXRzaWRlIHRoaXMgb3V0bGlu
ZSBvZiBjbG9zZS10by1vcGVuLCB0aGlzIGtlcm5lbCBiZWhhdmlvciBpcw0KPiA+ID4gPiBub3Qg
YSBidWcgYXMgc3VjaCAob3Igc28gaXQncyBiZWVuIGV4cGxhaW5lZCB0byBtZSBoZXJlKS4gIElm
DQo+ID4gPiA+IHlvdQ0KPiA+ID4gPiBnbyBvdXRzaWRlIGMtdC1vLCB0aGUga2VybmVsIGlzIGZy
ZWUgdG8gZG8gd2hhdGV2ZXIgaXQgZmluZHMNCj4gPiA+ID4gbW9zdA0KPiA+ID4gPiBjb252ZW5p
ZW50LCBhbmQgd2hhdCBpdCBmb3VuZCBtb3N0IGNvbnZlbmllbnQgd2FzIHRvIG5vdCBib3RoZXIN
Cj4gPiA+ID4gaW52YWxpZGF0aW5nIHNvbWUgY2FjaGVkIHBhZ2UgZGF0YSBldmVuIHRob3VnaCBp
dCBzYXcgYSBHRVRBVFRSDQo+ID4gPiA+IGNoYW5nZS4NCj4gPiA+IA0KPiA+ID4gVGhhdCB3b3Vs
ZCBiZSBhIGJ1Zy4gSWYgd2UgaGF2ZSByZWFzb24gdG8gYmVsaWV2ZSB0aGUgZmlsZSBoYXMNCj4g
PiA+IGNoYW5nZWQsIHRoZW4gd2UgbXVzdCBpbnZhbGlkYXRlIHRoZSBjYWNoZSBvbiB0aGUgZmls
ZSBwcmlvciB0bw0KPiA+ID4gYWxsb3dpbmcgYSByZWFkIHRvIHByb2NlZWQuDQo+ID4gDQo+ID4g
VGhlIHBvaW50IGhlcmUgaXMgdGhhdCB3aGVuIHRoZSBmaWxlIGlzIG9wZW4gZm9yIHdyaXRpbmcg
KG9yIGZvcg0KPiA+IHJlYWQrd3JpdGUpLCBhbmQgeW91ciBhcHBsaWNhdGlvbnMgYXJlIG5vdCB1
c2luZyBsb2NraW5nLCB0aGVuIHdlDQo+ID4gaGF2ZQ0KPiA+IG5vIHJlYXNvbiB0byBiZWxpZXZl
IHRoZSBmaWxlIGlzIGJlaW5nIGNoYW5nZWQgb24gdGhlIHNlcnZlciwgYW5kDQo+ID4gd2UNCj4g
PiBkZWxpYmVyYXRlbHkgb3B0aW1pc2UgZm9yIHRoZSBjYXNlIHdoZXJlIHRoZSBjYWNoZSBjb25z
aXN0ZW5jeQ0KPiA+IHJ1bGVzDQo+ID4gYXJlIGJlaW5nIG9ic2VydmVkLg0KPiANCj4gIEluIHRo
aXMgY2FzZSB0aGUgdXNlciBsZXZlbCBjYW4gYmUgY29tcGxldGVseSBzdXJlIHRoYXQgdGhlIGNs
aWVudA0KPiBrZXJuZWwgaGFzIGlzc3VlZCBhIEdFVEFUVFIgYW5kIHJlY2VpdmVkIGEgZGlmZmVy
ZW50IGFuc3dlciBmcm9tIHRoZQ0KPiBORlMgc2VydmVyLCBiZWNhdXNlIHRoZSBmc3RhdCgpIHJl
c3VsdHMgaXQgc2VlcyBoYXZlIGNoYW5nZWQgZnJvbSB0aGUNCj4gdmFsdWVzIGl0IGhhcyBzZWVu
IGJlZm9yZSAoYW5kIHJlbWVtYmVyZWQpLiBUaGlzIG1heSBub3QgY291bnQgYXMgdGhlDQo+IE5G
UyBjbGllbnQga2VybmVsIGNvZGUgJ1toYXZpbmddIHJlYXNvbiB0byBiZWxpZXZlJyB0aGF0IHRo
ZSBmaWxlIGhhcw0KPiBjaGFuZ2VkIG9uIHRoZSBzZXJ2ZXIgZnJvbSBpdHMgcGVyc3BlY3RpdmUs
IGJ1dCBpZiBzbyBpdCdzIG5vdA0KPiBiZWNhdXNlDQo+IHRoZSBpbmZvcm1hdGlvbiBpcyBub3Qg
YXZhaWxhYmxlIGFuZCBhIEdFVEFUVFIgd291bGQgaGF2ZSB0byBiZQ0KPiBleHBsaWNpdGx5DQo+
IGlzc3VlZCB0byBmaW5kIGl0IG91dC4gVGhlIGNsaWVudCBjb2RlIGhhcyBtYWRlIHRoZSBHRVRB
VFRSIGFuZA0KPiByZWNlaXZlZA0KPiBkaWZmZXJlbnQgcmVzdWx0cywgd2hpY2ggaXQgaGFzIHBh
c3NlZCB0byB1c2VyIGxldmVsOyBpdCBoYXMganVzdCBub3QNCj4gdXNlZCB0aG9zZSByZXN1bHRz
IHRvIGRvIHRoaW5ncyB0byBpdHMgY2FjaGVkIGRhdGEuDQo+IA0KPiAgVG9kYXksIGlmIHlvdSBk
byBhIGZsb2NrKCksIHRoZSBORlMgY2xpZW50IGNvZGUgaW4gdGhlIGtlcm5lbCB3aWxsDQo+IGRv
IHRoaW5ncyB0aGF0IGludmFsaWRhdGUgdGhlIGNhY2hlZCBkYXRhLCBkZXNwaXRlIHRoZSBHRVRB
VFRSIHJlc3VsdA0KPiBmcm9tIHRoZSBmaWxlc2VydmVyIG5vdCBjaGFuZ2luZy4gRnJvbSBteSBv
dXRzaWRlIHBlcnNwZWN0aXZlLCBhcw0KPiBzb21lb25lDQo+IHdyaXRpbmcgY29kZSBvciBkZWFs
aW5nIHdpdGggcHJvZ3JhbXMgdGhhdCBtdXN0IHdvcmsgb3ZlciBORlMsIHRoaXMNCj4gaXMgYQ0K
PiBsaXR0bGUgYml0IG1hZ2ljYWwsIGFuZCBhcyBhIHJlc3VsdCBJIHdvdWxkIGxpa2UgdG8gdW5k
ZXJzdGFuZCBpZiBpdA0KPiBpcw0KPiBndWFyYW50ZWVkIHRoYXQgdGhlIG1hZ2ljIHdvcmtzIG9y
IGlmIHRoaXMgaXMgbm90IG9mZmljaWFsbHkNCj4gc3VwcG9ydGVkDQo+IG1hZ2ljLCBtZXJlbHkg
J2l0IGhhcHBlbnMgdG8gd29yaycgbWFnaWMgaW4gdGhlIHdheSB0aGF0IGhhdmluZyB0aGUNCj4g
ZmlsZSBvcGVuIHJlYWQtd3JpdGUgd2l0aG91dCB0aGUgZmxvY2soKSB1c2VkIHRvIHdvcmsgaW4g
a2VybmVsIDQuNC54DQo+IGJ1dCBkb2Vzbid0IG5vdyAoYW5kIHRoaXMgaXMgc2ltcGx5IGNvbnNp
ZGVyZWQgdG8gYmUgdGhlIGtlcm5lbCB1c2luZw0KPiBDVE8gbW9yZSBzdHJvbmdseSwgbm90IGEg
YnVnKS4NCj4gDQo+IChMb29raW5nIGF0IGEgdGNwZHVtcCB0cmFjZSwgdGhlIGZsb2NrKCkgY2Fs
bCBhcHBlYXJzIHRvIGNhdXNlIHRoZQ0KPiBrZXJuZWwNCj4gdG8gaXNzdWUgYW5vdGhlciBHRVRB
VFRSIHRvIHRoZSBmaWxlc2VydmVyLiBUaGUgcmVzdWx0cyBhcmUgdGhlIHNhbWUNCj4gYXMNCj4g
dGhlIEdFVEFUVFIgcmVzdWx0cyB0aGF0IHdlcmUgcGFzc2VkIHRvIHRoZSBjbGllbnQgcHJvZ3Jh
bS4pDQoNCg0KVGhpcyBpcyBhbHNvIGRvY3VtZW50ZWQgaW4gdGhlIE5GUyBGQVEgdG8gd2hpY2gg
SSBwb2ludGVkIHlvdSBlYXJsaWVyLg0KDQo+ID4gQWdhaW4sIHRoZXNlIGFyZSB0aGUgY2FzZXMg
d2hlcmUgeW91IGFyZSBfbm90XyB1c2luZyBsb2NraW5nIHRvDQo+ID4gbWVkaWF0ZS4gSWYgeW91
IGFyZSB1c2luZyBsb2NraW5nLCB0aGVuIEkgYWdyZWUgdGhhdCBjaGFuZ2VzIG5lZWQNCj4gPiB0
bw0KPiA+IGJlIHNlZW4gYnkgdGhlIGNsaWVudC4NCj4gDQo+ICBUaGUgb3JpZ2luYWwgY29kZSAo
QWxwaW5lKSAqaXMqIHVzaW5nIGxvY2tpbmcgaW4gdGhlIGJyb2FkIHNlbnNlLA0KPiBidXQgaXQg
aXMgbm90IGZsb2NrKCkgbG9ja2luZzsgaW5zdGVhZCBpdCBpcyBsb2NraW5nIChpbiB0aGlzIGNh
c2UpDQo+IHRocm91Z2ggLmxvY2sgZmlsZXMuIFRoZSBjdXJyZW50IGtlcm5lbCBiZWhhdmlvciBh
bmQgd2hhdCBJJ3ZlIGJlZW4NCj4gdG9sZCBhYm91dCBpdCBpbXBsaWVzIHRoYXQgaXQgaXMgbm90
IHN1ZmZpY2llbnQgZm9yIHlvdXIgYXBwbGljYXRpb24NCj4gdG8NCj4gcGVyZmVjdGx5IGNvb3Jk
aW5hdGUgbG9ja2luZywgd3JpdGVzLCBmc3luYygpLCBhbmQgZnN0YXQoKSB2aXNpYmlsaXR5DQo+
IG9mIHRoZSByZXN1bHRpbmcgY2hhbmdlcyB0aHJvdWdoIGl0cyBvd24gbWVjaGFuaXNtOyB5b3Ug
bXVzdCBkbyB5b3VyDQo+IGxvY2tpbmcgdGhyb3VnaCB0aGUgb2ZmaWNpYWxseSBhcHByb3ZlZCBr
ZXJuZWwgY2hhbm5lbHMgKGFuZCBpdCBpcw0KPiBub3QNCj4gY2xlYXIgd2hhdCB0aGV5IGFyZSkg
b3Igc2VlIHBvdGVudGlhbGx5IGluY29ycmVjdCByZXN1bHRzLg0KPiANCj4gIENvbnNpZGVyIGEg
c3lzdGVtIHdoZXJlIHJlYWRzIGFuZCB3cml0ZXMgdG8gYSBzaGFyZWQgZmlsZSBhcmUNCj4gY29v
cmRpbmF0ZWQgYnkgYSBjZW50cmFsIHByb2Nlc3MgdGhhdCBldmVyeW9uZSBjb21tdW5pY2F0ZXMg
d2l0aA0KPiB0aHJvdWdoDQo+IFRDUCBjb25uZWN0aW9ucy4gVGhlIGNlbnRyYWwgcHJvY2VzcyBw
YXVzZXMgcmVhZGVycyBiZWZvcmUgaXQgYWxsb3dzDQo+IGEgd3JpdGVyIHRvIHN0YXJ0LCB0aGUg
d3JpdGVyIGFsd2F5cyBmc3luYygpcyBiZWZvcmUgaXQgcmVsZWFzZXMgaXRzDQo+IHdyaXRlIHBl
cm1pc3Npb25zLCBhbmQgdGhlbiBubyByZWFkZXIgaXMgcGVybWl0dGVkIHRvIHByb2NlZWQgdW50
aWwNCj4gdGhlDQo+IGVudGlyZSBjbHVzdGVyIHNlZXMgdGhlIHNhbWUgdXBkYXRlZCBmc3RhdCgp
IHJlc3VsdC4gVGhpcyBpcw0KPiBwZXJmZWN0bHkNCj4gY29vcmRpbmF0ZWQgYnV0IGN1cnJlbnRs
eSBjb3VsZCBzZWUgaW5jb3JyZWN0IHJlYWQoKSByZXN1bHRzLCBhbmQNCj4gSSd2ZQ0KPiBiZWVu
IHRvbGQgdGhhdCB0aGlzIGlzIGFsbG93ZWQgdW5kZXIgTGludXgncyBDVE8gcnVsZXMgYmVjYXVz
ZSBhbGwgb2YNCj4gdGhlIHByb2Nlc3NlcyBob2xkIHRoZSBmaWxlIG9wZW4gcmVhZC13cml0ZSB0
aHJvdWdoIHRoaXMgZW50aXJlDQo+IHByb2Nlc3MNCj4gKGFuZCBubyBvbmUgZmxvY2soKXMpLg0K
PiANCg0KV2h5IHdvdWxkIHN1Y2ggYSBzeXN0ZW0gbmVlZCB0byB1c2UgYnVmZmVyZWQgSS9PIGlu
c3RlYWQgb2YgdW5jYWNoZWQNCkkvTyAoaS5lLiBPX0RJUkVDVCk/IFdoYXQgd291bGQgYmUgdGhl
IHBvaW50IG9mIG9wdGltaXNpbmcgdGhlIGJ1ZmZlcmVkDQpJL08gY2xpZW50IGZvciB0aGlzIHVz
ZSBjYXNlIHJhdGhlciB0aGFuIHRoZSBjbG9zZSB0byBvcGVuIGNhY2hlDQpjb25zaXN0ZW50IGNh
c2U/DQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXIs
IEhhbW1lcnNwYWNlDQp0cm9uZC5teWtsZWJ1c3RAaGFtbWVyc3BhY2UuY29tDQoNCg0K