I'm currently experiencing an issue where reading from a file that is
known to only contain ascii characters can result in the process
reading that file seeing NULL bytes (0x0) in the file that never
existed. This happens for NFSv3 clients when the file is being
updated concurrently either by another NFSv3 client or directly on the
NFS server.
Its possible this issue is already known as I've found the following
RHEL knowledge base article talking about it (unfortunately you need
to be logged in to view the full content of the following link):
https://access.redhat.com/solutions/22717
If this is already a known issue and there is relevant discussion of
it, I would appreciate it if someone could point me to it as I
couldn't find anything (but, its possible I just missed it).
If it isn't a known issue, here are the details of how I've been reproducing:
I have reproduced this on RHEL 6.5 (2.6.32.431.1.2.0.1.el6) as well as
on Fedora 21 (kernel-3.18.7-200.fc21).
In the RHEL 6.5 case, I am using two systems with the aforementioned
kernel and an Isilon as the NFS server. In the Fedora 21 case, I am
using just one system, acting as both client and server.
The basic idea is to touch something like 10 files, then start the
attached python program (tail.py) on them "tailing" them as they grow
and finally start updating them either from another NFS client or from
the server directly. Here is the general set of steps for reproducing
using a single system:
# mkdir -p /export/test
# mkdir -p /mnt/test
# echo '/export/test localhost(rw)' > /etc/exports
# exportfs -fa (make sure nfsd is running, etc)
# mount -t nfs -o vers=3 localhost:/export/test /mnt/test
(start the tailers)
# touch /mnt/test/file{00..09}
# for f in /mnt/test/file{00..09}; do python tail.py ${f} & done &&
cat >/dev/null
(start the updating)
# for f in /export/test/file{00..09}; do while true; do echo 'abc' >>
${f}; sleep .1; done & done
After a little while, you'll see output like the following from the
"tailer" processes:
file08: ['61', '62', '63', '0A', '61', '62', '63', '0A', '61', '62',
'63', '0A', '61', '62', '63', '0A', '00', '00', '00', '00']
file04: ['61', '62', '63', '0A', '61', '62', '63', '0A', '61', '62',
'63', '0A', '61', '62', '63', '0A', '00', '00', '00', '00']
file02: ['61', '62', '63', '0A', '61', '62', '63', '0A', '61', '62',
'63', '0A', '61', '62', '63', '0A', '00', '00', '00', '00']
As you can see, there are NULL bytes at the end that never existed.
I've done some analysis of this using SystemTap and am happy to share
details if necessary. I believe the issue is due to concurrent
execution of `do_generic_file_read' (by the process making the read
syscall) and `nfs_update_inode' by rpciod. rpciod updates the inode
size, but before it can invalidate the page cache, our process making
the read syscall has already found the the page and is now using the
vfs inode's size to determine how much data to copy out of the page
back into userspace.
It seems like since the VFS layer doesn't do any locking of the inode
(or at least `do_generic_file_read' doesn't), then the fact that
rpciod is updating the inode under the protection of `i_lock' doesn't
help. My naive thought was to simply mark the page cache invalid
before updating the inode's size in `nfs_update_inode', but I'm unsure
if that would have other unwanted effects.
Any thoughts or insights appreciated.
On Mon, Feb 23, 2015 at 3:56 PM, Chris Perl <[email protected]> wrote:
> I'm currently experiencing an issue where reading from a file that is
> known to only contain ascii characters can result in the process
> reading that file seeing NULL bytes (0x0) in the file that never
> existed. This happens for NFSv3 clients when the file is being
> updated concurrently either by another NFSv3 client or directly on the
> NFS server.
That's expected behaviour, yes. NFS close-to-open cache consistency
semantics do not allow for concurrent access while the file is being
updated.
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
> That's expected behaviour, yes. NFS close-to-open cache consistency
> semantics do not allow for concurrent access while the file is being
> updated.
Can you elaborate on that a bit?
My rough understanding of close-to-open cache consistency is basically
that a clients changes are not guaranteed to be on the server until a
close(2) and that a client is only guaranteed to check for updates on
open(2).
I can therefore see why one client reading while another is appending
could result in stale data being read from the cache (and indeed, this
is the reason the test program I provided continually open and closes
the file, seeking as necessary). But I'm having a hard time seeing
why close-to-open cache consistency allows for returning data from the
cache that never existed in the file.
On Wed, Feb 25, 2015 at 12:04 PM, Chris Perl <[email protected]> wrote:
>> That's expected behaviour, yes. NFS close-to-open cache consistency
>> semantics do not allow for concurrent access while the file is being
>> updated.
>
> Can you elaborate on that a bit?
>
> My rough understanding of close-to-open cache consistency is basically
> that a clients changes are not guaranteed to be on the server until a
> close(2) and that a client is only guaranteed to check for updates on
> open(2).
>
> I can therefore see why one client reading while another is appending
> could result in stale data being read from the cache (and indeed, this
> is the reason the test program I provided continually open and closes
> the file, seeking as necessary). But I'm having a hard time seeing
> why close-to-open cache consistency allows for returning data from the
> cache that never existed in the file.
So imagine 2 WRITE calls that are being sent to an initially empty
file. One WRITE call is for offset 0, and length 4096 bytes. The
second call is for offset 4096 and length 4096 bytes.
Imagine now that the first WRITE gets delayed (either because the page
cache isn't flushing that part of the file yet or because it gets
re-ordered in the RPC layer or on the server), and the second WRITE is
received and processed by the server first.
Once the delayed WRITE is processed there will be data at offset 0,
but until that happens, anyone reading the file on the server will see
a hole of length 4096 bytes.
This kind of issue is why close-to-open cache consistency relies on
only one client accessing the file on the server when it is open for
writing.
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
> So imagine 2 WRITE calls that are being sent to an initially empty
> file. One WRITE call is for offset 0, and length 4096 bytes. The
> second call is for offset 4096 and length 4096 bytes.
> Imagine now that the first WRITE gets delayed (either because the page
> cache isn't flushing that part of the file yet or because it gets
> re-ordered in the RPC layer or on the server), and the second WRITE is
> received and processed by the server first.
> Once the delayed WRITE is processed there will be data at offset 0,
> but until that happens, anyone reading the file on the server will see
> a hole of length 4096 bytes.
>
> This kind of issue is why close-to-open cache consistency relies on
> only one client accessing the file on the server when it is open for
> writing.
Fair enough. I am taking note of the fact that you said "This kind of
issue" implying there are probably other subtle cases I'm not thinking
about or that your example does not illustrate.
That said, in your example, there exists some moment in time when the
file on the server actually does have a hole in it full of 0's. In my
case, the file never contains 0's.
To be fair, when testing with an Isilon, I can't actually inspect the
state of the file on the server in any meaningful way, so I can't be
certain that's true. But, from the view point of the reading client
at the NFS layer there are never 0's read back across the wire. I've
confirmed this by matching up wireshark traces while reproducing and
the READ reply's never contain 0's. The 0's manifest due to reading
too far past where there is valid data in the page cache.
Does this still fall under the "not expected to work" category because
of the close-to-open issues you explained, or is there perhaps
something to fix here because its entirely on the reading client side?
On Wed, Feb 25, 2015 at 4:02 PM, Chris Perl <[email protected]> wrote:
>> So imagine 2 WRITE calls that are being sent to an initially empty
>> file. One WRITE call is for offset 0, and length 4096 bytes. The
>> second call is for offset 4096 and length 4096 bytes.
>> Imagine now that the first WRITE gets delayed (either because the page
>> cache isn't flushing that part of the file yet or because it gets
>> re-ordered in the RPC layer or on the server), and the second WRITE is
>> received and processed by the server first.
>> Once the delayed WRITE is processed there will be data at offset 0,
>> but until that happens, anyone reading the file on the server will see
>> a hole of length 4096 bytes.
>>
>> This kind of issue is why close-to-open cache consistency relies on
>> only one client accessing the file on the server when it is open for
>> writing.
>
> Fair enough. I am taking note of the fact that you said "This kind of
> issue" implying there are probably other subtle cases I'm not thinking
> about or that your example does not illustrate.
>
> That said, in your example, there exists some moment in time when the
> file on the server actually does have a hole in it full of 0's. In my
> case, the file never contains 0's.
>
> To be fair, when testing with an Isilon, I can't actually inspect the
> state of the file on the server in any meaningful way, so I can't be
> certain that's true. But, from the view point of the reading client
> at the NFS layer there are never 0's read back across the wire. I've
> confirmed this by matching up wireshark traces while reproducing and
> the READ reply's never contain 0's. The 0's manifest due to reading
> too far past where there is valid data in the page cache.
Then that could be a GETATTR or something similar extending the file
size outside the READ rpc call. Since the pagecache data is copied to
userspace without any locks being held, we cannot prevent that race.
We're not going to push to fix that particular issue, since it is only
one of many possible races (see my previous example) when you try to
violate close-to-open cache consistency.
> Does this still fall under the "not expected to work" category because
> of the close-to-open issues you explained, or is there perhaps
> something to fix here because its entirely on the reading client side?
There is nothing to fix. The close-to-open cache consistency model is
clear that your applications must not access a file that is being
modified on another client or on the server.
If you choose to operate outside that caching model, then your
application needs to provide its own consistency. Usually, that means
either using POSIX file locking, or using O_DIRECT/uncached I/O in
combination with some other synchronisation method.
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
> There is nothing to fix. The close-to-open cache consistency model is
> clear that your applications must not access a file that is being
> modified on another client or on the server.
Ok, thanks for helping me understand this a little more clearly. For
my own edification, is there somewhere I can find the details where
these things are spelled out (or is it just somewhere in rfc1813 that
I haven't seen)?
On Wed, Feb 25, 2015 at 4:53 PM, Chris Perl <[email protected]> wrote:
>> There is nothing to fix. The close-to-open cache consistency model is
>> clear that your applications must not access a file that is being
>> modified on another client or on the server.
>
> Ok, thanks for helping me understand this a little more clearly. For
> my own edification, is there somewhere I can find the details where
> these things are spelled out (or is it just somewhere in rfc1813 that
> I haven't seen)?
There is a short description here:
http://nfs.sourceforge.net/#faq_a8
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
>> Ok, thanks for helping me understand this a little more clearly. For
>> my own edification, is there somewhere I can find the details where
>> these things are spelled out (or is it just somewhere in rfc1813 that
>> I haven't seen)?
>
> There is a short description here:
> http://nfs.sourceforge.net/#faq_a8
Yes, thanks. I had come across that when trying to do a little
research after your initial reply.
However, I was hoping for something with a little more detail. For
example, you said earlier that "the close-to-open cache consistency
model is clear ...", which implied to me that there was a more formal
description somewhere outlining the semantics and constraints. Or is
it just more of an implementation detail?
Also, reading that FAQ entry seems to reinforce my original notion
that a client reading a file that is being updated might get stale
data returned from its cache, but shouldn't get corrupt data returned
from its cache. Perhaps the FAQ entry should be updated to explicitly
note that corrupt data can be returned?
FWIW, I realize that the use case I've given as a reproducer isn't
supported and isn't supposed to work. I accept that and that is fine.
However, we do run into this problem in "everyday types of file
sharing" (to quote that FAQ). Sometimes (not very often, but enough
that its annoying), someone will cat a file that is already in their
clients page cache, and it happens to be at just the wrong time,
resulting in corrupt data being read.
On Thu, Feb 26, 2015 at 7:41 AM, Chris Perl <[email protected]> wrote:
>>> Ok, thanks for helping me understand this a little more clearly. For
>>> my own edification, is there somewhere I can find the details where
>>> these things are spelled out (or is it just somewhere in rfc1813 that
>>> I haven't seen)?
>>
>> There is a short description here:
>> http://nfs.sourceforge.net/#faq_a8
>
> Yes, thanks. I had come across that when trying to do a little
> research after your initial reply.
>
> However, I was hoping for something with a little more detail. For
> example, you said earlier that "the close-to-open cache consistency
> model is clear ...", which implied to me that there was a more formal
> description somewhere outlining the semantics and constraints. Or is
> it just more of an implementation detail?
>
> Also, reading that FAQ entry seems to reinforce my original notion
> that a client reading a file that is being updated might get stale
> data returned from its cache, but shouldn't get corrupt data returned
> from its cache. Perhaps the FAQ entry should be updated to explicitly
> note that corrupt data can be returned?
>
> FWIW, I realize that the use case I've given as a reproducer isn't
> supported and isn't supposed to work. I accept that and that is fine.
> However, we do run into this problem in "everyday types of file
> sharing" (to quote that FAQ). Sometimes (not very often, but enough
> that its annoying), someone will cat a file that is already in their
> clients page cache, and it happens to be at just the wrong time,
> resulting in corrupt data being read.
If you are saying that we're not making it clear enough that "you
ignore these rules at your peril" then, fair enough, I'm sure Chuck
would be able to add a line to the faq stating just that.
However if you are asking us for an extensive list of "this is what I
can expect if I ignore these rules", then I don't think you will find
much traction. Such a list would be committing us to defining a model
for "non-close-to-open" semantics, which isn't of interest to me at
least, and I doubt anyone else is interested in committing to
maintaining that.
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
> If you are saying that we're not making it clear enough that "you
> ignore these rules at your peril" then, fair enough, I'm sure Chuck
> would be able to add a line to the faq stating just that.
Yes, that's basically what I'm saying. I doubt many end users of NFS
realize that, given default mount options on modern Linux
distributions, simply catting a file that may have just been updated
on another system can result in reading back extra bytes that don't
exist in the file.
I guess my main point is the notion of reading "stale" data vs reading
"corrupt" data from the cache.
> However if you are asking us for an extensive list of "this is what I
> can expect if I ignore these rules", then I don't think you will find
> much traction. Such a list would be committing us to defining a model
> for "non-close-to-open" semantics, which isn't of interest to me at
> least, and I doubt anyone else is interested in committing to
> maintaining that.
I was just asking if such a thing existed. It sounds like it doesn't.
I don't expect you to produce one.
I, for one, simply was not aware that close-to-open cache consistency
required that if a file is open for writing on one client that it MUST
NOT be read by other clients at the same time (unless you potentially
want to read corrupt data).
Thanks for taking the time to discuss and clarify, its much appreciated.
> However if you are asking us for an extensive list of "this is what I
> can expect if I ignore these rules", then I don't think you will find
> much traction. Such a list would be committing us to defining a model
> for "non-close-to-open" semantics, which isn't of interest to me at
> least, and I doubt anyone else is interested in committing to
> maintaining that.
One more point on this. I wasn't really asking for a list of what I
can expect if I ignore the rules (although I think pointing out that
reading corrupt data from the cache is worth mentioning), I was asking
what the rules for close-to-open consistency were so I can follow
them. I now know one of them is that if a file is open for writing on
one client then it can't be read on another. Are there others?
On Thu, 2015-02-26 at 09:10 -0500, Chris Perl wrote:
> > However if you are asking us for an extensive list of "this is what I
> > can expect if I ignore these rules", then I don't think you will find
> > much traction. Such a list would be committing us to defining a model
> > for "non-close-to-open" semantics, which isn't of interest to me at
> > least, and I doubt anyone else is interested in committing to
> > maintaining that.
>
> One more point on this. I wasn't really asking for a list of what I
> can expect if I ignore the rules (although I think pointing out that
> reading corrupt data from the cache is worth mentioning), I was asking
> what the rules for close-to-open consistency were so I can follow
> them. I now know one of them is that if a file is open for writing on
> one client then it can't be read on another. Are there others?
Is this a rule or a bug ?
How does an application know that the file is open elsewhere for
writing ?
Simo.
--
Simo Sorce * Red Hat, Inc * New York
On Thu, 2015-02-26 at 10:22 -0500, Simo Sorce wrote:
> On Thu, 2015-02-26 at 09:10 -0500, Chris Perl wrote:
> > > However if you are asking us for an extensive list of "this is what I
> > > can expect if I ignore these rules", then I don't think you will find
> > > much traction. Such a list would be committing us to defining a model
> > > for "non-close-to-open" semantics, which isn't of interest to me at
> > > least, and I doubt anyone else is interested in committing to
> > > maintaining that.
> >
> > One more point on this. I wasn't really asking for a list of what I
> > can expect if I ignore the rules (although I think pointing out that
> > reading corrupt data from the cache is worth mentioning), I was asking
> > what the rules for close-to-open consistency were so I can follow
> > them. I now know one of them is that if a file is open for writing on
> > one client then it can't be read on another. Are there others?
>
> Is this a rule or a bug ?
> How does an application know that the file is open elsewhere for
> writing ?
It is up to you to ensure that you don't set up such a situation, just
like it is also your responsibility to ensure that you don't run 2
applications that do read-modify-writes to the same file on a regular
POSIX filesystem.
This is a rule that has worked just fine for the NFS community for more
than 30 years. It isn't anything new that we're only adding to Linux.
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
On Thu, 2015-02-26 at 10:34 -0500, Trond Myklebust wrote:
> On Thu, 2015-02-26 at 10:22 -0500, Simo Sorce wrote:
> > On Thu, 2015-02-26 at 09:10 -0500, Chris Perl wrote:
> > > > However if you are asking us for an extensive list of "this is what I
> > > > can expect if I ignore these rules", then I don't think you will find
> > > > much traction. Such a list would be committing us to defining a model
> > > > for "non-close-to-open" semantics, which isn't of interest to me at
> > > > least, and I doubt anyone else is interested in committing to
> > > > maintaining that.
> > >
> > > One more point on this. I wasn't really asking for a list of what I
> > > can expect if I ignore the rules (although I think pointing out that
> > > reading corrupt data from the cache is worth mentioning), I was asking
> > > what the rules for close-to-open consistency were so I can follow
> > > them. I now know one of them is that if a file is open for writing on
> > > one client then it can't be read on another. Are there others?
> >
> > Is this a rule or a bug ?
> > How does an application know that the file is open elsewhere for
> > writing ?
>
> It is up to you to ensure that you don't set up such a situation, just
> like it is also your responsibility to ensure that you don't run 2
> applications that do read-modify-writes to the same file on a regular
> POSIX filesystem.
>
> This is a rule that has worked just fine for the NFS community for more
> than 30 years. It isn't anything new that we're only adding to Linux.
Ok, I assume proper use of locking will address the situation and allow
readers to get consistent data and not null bytes ?
Simo.
--
Simo Sorce * Red Hat, Inc * New York
> Ok, I assume proper use of locking will address the situation and allow
> readers to get consistent data and not null bytes ?
Yes, trond said this explicitly in one of his earlier replies.
On Thu, 2015-02-26 at 10:45 -0500, Chris Perl wrote:
> > Ok, I assume proper use of locking will address the situation and allow
> > readers to get consistent data and not null bytes ?
>
> Yes, trond said this explicitly in one of his earlier replies.
Thank you, my confusion has been cleared.
Simo.
--
Simo Sorce * Red Hat, Inc * New York
Hi Chris,
On Thu, 2015-02-26 at 10:45 -0500, Chris Perl wrote:
> > Ok, I assume proper use of locking will address the situation and allow
> > readers to get consistent data and not null bytes ?
>
> Yes, trond said this explicitly in one of his earlier replies.
You may want to double check by modifying your reproducer to use
locking.
Just want to clarify that your example is of a writer appending to a
file while a reader keeps trying to read upto the EOF. If so, what's
happening is that the metadata (file size) becomes out of sync with
actual file data requested from the NFS server due to post-op attr
attached to the READ reply containing an updated (increased) file size.
Then NULLs are introduced between the actual file data received and the
new increased file size. This is then presented to user space.
Before the synchronous READ path was removed from the NFS client code,
this issue could be avoided by using the "sync" NFS mount option.
cya,
#
> Just want to clarify that your example is of a writer appending to a
> file while a reader keeps trying to read upto the EOF.
Yes, the reader keeps issuing fixed size reads and eventually hits
EOF. It then stat's the file over and over and when it sees more data
available to read, it does the same thing. If I modify the reproducer
to calculate the available bytes left in the file (as reported by the
stat) and then make sure to never hit EOF, it is no longer
reproducible.
> If so, what's
> happening is that the metadata (file size) becomes out of sync with
> actual file data requested from the NFS server due to post-op attr
> attached to the READ reply containing an updated (increased) file size.
> Then NULLs are introduced between the actual file data received and the
> new increased file size. This is then presented to user space.
This is exactly what I believe is happening (as I kind of mentioned in
earlier messages, although I haven't confirmed that its post-op attrs
updating the size).
Are you saying that even with locking you think this might still be
possible? I had assumed that would synchronize the appender and the
reader enough that it seemed unlikely, but I can try it.
> It is up to you to ensure that you don't set up such a situation, just
> like it is also your responsibility to ensure that you don't run 2
> applications that do read-modify-writes to the same file on a regular
> POSIX filesystem.
>
> This is a rule that has worked just fine for the NFS community for more
> than 30 years. It isn't anything new that we're only adding to Linux.
I'm not trying to be annoying and hope I'm not coming across as such.
Given that I've used NFS for many years being blissfully unaware of
the close-to-open cache consistency requirement you've described (it
seems to work most of the time!), I'm generally curious if there are
other such rules I must follow or if the one we've been discussing is
the only such rule.
On Thu, Feb 26, 2015 at 11:00 AM, Chris Perl <[email protected]> wrote:
>> It is up to you to ensure that you don't set up such a situation, just
>> like it is also your responsibility to ensure that you don't run 2
>> applications that do read-modify-writes to the same file on a regular
>> POSIX filesystem.
>>
>> This is a rule that has worked just fine for the NFS community for more
>> than 30 years. It isn't anything new that we're only adding to Linux.
>
> I'm not trying to be annoying and hope I'm not coming across as such.
>
> Given that I've used NFS for many years being blissfully unaware of
> the close-to-open cache consistency requirement you've described (it
> seems to work most of the time!), I'm generally curious if there are
> other such rules I must follow or if the one we've been discussing is
> the only such rule.
The only other things that I can think of are:
1) Attributes are generally not immediately updated on clients. Even
if you specify 'noac', there is no locking that can ensure size
changes, for instance, are reflected well enough on the client to
ensure perfect O_APPEND semantics.
2) Readdir results are unreliable when the directory is being updated
on the server or on other clients. You should still be able to depend
on O_EXCL file create semantics, though.
3) the lack of flock() type locks. We have to emulate them using byte
range locks, which means that under NFS the fcntl() and flock() locks
can and will conflict with one another.
4) file notifications generally do not work well, since there is no
protocol support for that either
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
On Thu, 2015-02-26 at 10:22 -0500, Simo Sorce wrote:
> On Thu, 2015-02-26 at 09:10 -0500, Chris Perl wrote:
> > > However if you are asking us for an extensive list of "this is what I
> > > can expect if I ignore these rules", then I don't think you will find
> > > much traction. Such a list would be committing us to defining a model
> > > for "non-close-to-open" semantics, which isn't of interest to me at
> > > least, and I doubt anyone else is interested in committing to
> > > maintaining that.
> >
> > One more point on this. I wasn't really asking for a list of what I
> > can expect if I ignore the rules (although I think pointing out that
> > reading corrupt data from the cache is worth mentioning), I was asking
> > what the rules for close-to-open consistency were so I can follow
> > them. I now know one of them is that if a file is open for writing on
> > one client then it can't be read on another. Are there others?
>
> Is this a rule or a bug ?
> How does an application know that the file is open elsewhere for
> writing ?
It is up to you to ensure that you don't set up such a situation, just
like it is also your responsibility to ensure that you don't run 2
applications that do read-modify-writes to the same file on a regular
POSIX filesystem.
This is a rule that has worked just fine for the NFS community for more
than 30 years. It isn't anything new that we're only adding to Linux.
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
On Thu, Feb 26, 2015 at 08:29:51AM -0500, Trond Myklebust wrote:
> On Thu, Feb 26, 2015 at 7:41 AM, Chris Perl <[email protected]> wrote:
> >>> Ok, thanks for helping me understand this a little more clearly. For
> >>> my own edification, is there somewhere I can find the details where
> >>> these things are spelled out (or is it just somewhere in rfc1813 that
> >>> I haven't seen)?
> >>
> >> There is a short description here:
> >> http://nfs.sourceforge.net/#faq_a8
> >
> > Yes, thanks. I had come across that when trying to do a little
> > research after your initial reply.
> >
> > However, I was hoping for something with a little more detail. For
> > example, you said earlier that "the close-to-open cache consistency
> > model is clear ...", which implied to me that there was a more formal
> > description somewhere outlining the semantics and constraints. Or is
> > it just more of an implementation detail?
> >
> > Also, reading that FAQ entry seems to reinforce my original notion
> > that a client reading a file that is being updated might get stale
> > data returned from its cache, but shouldn't get corrupt data returned
> > from its cache. Perhaps the FAQ entry should be updated to explicitly
> > note that corrupt data can be returned?
> >
> > FWIW, I realize that the use case I've given as a reproducer isn't
> > supported and isn't supposed to work. I accept that and that is fine.
> > However, we do run into this problem in "everyday types of file
> > sharing" (to quote that FAQ). Sometimes (not very often, but enough
> > that its annoying), someone will cat a file that is already in their
> > clients page cache, and it happens to be at just the wrong time,
> > resulting in corrupt data being read.
>
> If you are saying that we're not making it clear enough that "you
> ignore these rules at your peril" then, fair enough, I'm sure Chuck
> would be able to add a line to the faq stating just that.
Yeah, I don't think that FAQ answer is clear. It talks a little about
how close-to-open is implemented but doesn't really state clearly what
applications can assume. The first paragraph comes close, but it's
really just a motivating example.
A rought attempt, but it feels a little overboard while still
incomplete:
- access from multiple processes on the same client provides the
same guarantees as on local filesystems.
- access from multiple clients will provide the same guarantees
as long as no client's open for write overlaps any other open
from another client.
- if a client does open a file for read while another holds it
open for write, results of that client's reads are undefined.
- More specifically, a read of a byte at offset X could return a
byte that has been written to that offset during a concurrent
write open, or the value stored there at offset X at the start
of that write open, or 0 if X happened to be past the end of
file at any point during the concurrent write open. The
reader may not assume any relationships among values at
different offsets or the file size, updates to any of which
may be seen in any order.
?
--b.
On Feb 27, 2015, at 5:40 PM, [email protected] wrote:
> On Thu, Feb 26, 2015 at 08:29:51AM -0500, Trond Myklebust wrote:
>> On Thu, Feb 26, 2015 at 7:41 AM, Chris Perl <[email protected]> wrote:
>>>>> Ok, thanks for helping me understand this a little more clearly. For
>>>>> my own edification, is there somewhere I can find the details where
>>>>> these things are spelled out (or is it just somewhere in rfc1813 that
>>>>> I haven't seen)?
>>>>
>>>> There is a short description here:
>>>> http://nfs.sourceforge.net/#faq_a8
>>>
>>> Yes, thanks. I had come across that when trying to do a little
>>> research after your initial reply.
>>>
>>> However, I was hoping for something with a little more detail. For
>>> example, you said earlier that "the close-to-open cache consistency
>>> model is clear ...", which implied to me that there was a more formal
>>> description somewhere outlining the semantics and constraints. Or is
>>> it just more of an implementation detail?
>>>
>>> Also, reading that FAQ entry seems to reinforce my original notion
>>> that a client reading a file that is being updated might get stale
>>> data returned from its cache, but shouldn't get corrupt data returned
>>> from its cache. Perhaps the FAQ entry should be updated to explicitly
>>> note that corrupt data can be returned?
>>>
>>> FWIW, I realize that the use case I've given as a reproducer isn't
>>> supported and isn't supposed to work. I accept that and that is fine.
>>> However, we do run into this problem in "everyday types of file
>>> sharing" (to quote that FAQ). Sometimes (not very often, but enough
>>> that its annoying), someone will cat a file that is already in their
>>> clients page cache, and it happens to be at just the wrong time,
>>> resulting in corrupt data being read.
>>
>> If you are saying that we're not making it clear enough that "you
>> ignore these rules at your peril" then, fair enough, I'm sure Chuck
>> would be able to add a line to the faq stating just that.
>
> Yeah, I don't think that FAQ answer is clear. It talks a little about
> how close-to-open is implemented but doesn't really state clearly what
> applications can assume. The first paragraph comes close, but it's
> really just a motivating example.
>
> A rought attempt, but it feels a little overboard while still
> incomplete:
I?m in favor of staying more hand-wavy. Otherwise you will end up
making promises you don?t intend to keep ;-)
Something like:
> Because NFS is not a cluster or ?single system image? filesystem,
> applications must provide proper serialization of reads and writes among multiple clients to ensure correct application behavior and
> prevent corruption of file data. The close-to-open mechanism is not
> adequate in the presence of concurrent opens for write when multiple
> clients are involved.
Plus or minus some word-smithing.
And, let?s consider updating the DATA AND METADATA COHERENCY section
of nfs(5), which contains a similar discussion of close-to-open
cache consistency.
> - access from multiple processes on the same client provides the
> same guarantees as on local filesystems.
>
> - access from multiple clients will provide the same guarantees
> as long as no client's open for write overlaps any other open
> from another client.
>
> - if a client does open a file for read while another holds it
> open for write, results of that client's reads are undefined.
>
> - More specifically, a read of a byte at offset X could return a
> byte that has been written to that offset during a concurrent
> write open, or the value stored there at offset X at the start
> of that write open, or 0 if X happened to be past the end of
> file at any point during the concurrent write open. The
> reader may not assume any relationships among values at
> different offsets or the file size, updates to any of which
> may be seen in any order.
>
> ?
>
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
> I *think* in the situation where the file is being appended to, you're not
> likely to see anything but nulls.
Agreed.
> I'm not an expert but I think this occurs
> when the read happens between the file size being extended and the new data
> being written, effectively making a 'hole' at the end of the file.
As said in other replies, its because the size is updated causing too
much data to be read out of the page cache, effectively appending
NULLs to the data.
> When you're reading a file that's been concurrently modified, you can get
> corruption (a mix of new and old data) on pretty much any filesystem that
> doesn't enforce mandatory locks. It's just much less likely on a local
> filesystem because the window is shorter and more operations are atomic.
If you're talking about a local filesystem and you have one process
appending data (with O_APPEND) and another reading, I don't believe
you'll see the same sort of issue (but, I could be wrong).
On Feb 25, 2015, at 4:47 PM, Trond Myklebust <[email protected]> wrote:
> On Wed, Feb 25, 2015 at 4:02 PM, Chris Perl <[email protected]> wrote:
>>> So imagine 2 WRITE calls that are being sent to an initially empty
>>> file. One WRITE call is for offset 0, and length 4096 bytes. The
>>> second call is for offset 4096 and length 4096 bytes.
>>> Imagine now that the first WRITE gets delayed (either because the page
>>> cache isn't flushing that part of the file yet or because it gets
>>> re-ordered in the RPC layer or on the server), and the second WRITE is
>>> received and processed by the server first.
>>> Once the delayed WRITE is processed there will be data at offset 0,
>>> but until that happens, anyone reading the file on the server will see
>>> a hole of length 4096 bytes.
>>>
>>> This kind of issue is why close-to-open cache consistency relies on
>>> only one client accessing the file on the server when it is open for
>>> writing.
>>
>> Fair enough. I am taking note of the fact that you said "This kind of
>> issue" implying there are probably other subtle cases I'm not thinking
>> about or that your example does not illustrate.
>>
>> That said, in your example, there exists some moment in time when the
>> file on the server actually does have a hole in it full of 0's. In my
>> case, the file never contains 0's.
>>
>> To be fair, when testing with an Isilon, I can't actually inspect the
>> state of the file on the server in any meaningful way, so I can't be
>> certain that's true. But, from the view point of the reading client
>> at the NFS layer there are never 0's read back across the wire. I've
>> confirmed this by matching up wireshark traces while reproducing and
>> the READ reply's never contain 0's. The 0's manifest due to reading
>> too far past where there is valid data in the page cache.
>
> Then that could be a GETATTR or something similar extending the file
> size outside the READ rpc call. Since the pagecache data is copied to
> userspace without any locks being held, we cannot prevent that race.
FWIW it?s easy to reproduce a similar race with fsx, and I encounter
it frequently while running xfstests on fast NFS servers.
fsx invokes ftruncate following a set of asynchronous reads
(generated possibly due to readahead). The reads are started first,
then the SETATTR, but they complete out of order.
The SETATTR changes the test file?s size, and the completion
updates the file size in the client?s inode. Then the read requests
complete on the client and set the file?s size back to its old value.
All it takes is one late read completion, and the cached file size
is corrupted. fsx detects the file size mismatch and terminates the
test. The file size is corrected by a subsequent GETATTR (say, an
?ls -l? to check it after fsx has terminated).
While SETATTR blocks concurrent writes, there?s no serialization
on either the client or server to help guarantee the ordering of
SETATTR with read operations.
I?ve found a successful workaround by forcing the client to ignore
post-op attrs in read replies. A stronger solution might simply set
the ?file attributes need update? flag in the inode if any file
attribute mutation is noticed during a read completion.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
On Wed, 2015-02-25 at 17:32 -0500, Chuck Lever wrote:
> FWIW it’s easy to reproduce a similar race with fsx, and I encounter
> it frequently while running xfstests on fast NFS servers.
>
> fsx invokes ftruncate following a set of asynchronous reads
> (generated possibly due to readahead). The reads are started first,
> then the SETATTR, but they complete out of order.
>
> The SETATTR changes the test file’s size, and the completion
> updates the file size in the client’s inode. Then the read requests
> complete on the client and set the file’s size back to its old value.
>
> All it takes is one late read completion, and the cached file size
> is corrupted. fsx detects the file size mismatch and terminates the
> test. The file size is corrected by a subsequent GETATTR (say, an
> “ls -l” to check it after fsx has terminated).
>
> While SETATTR blocks concurrent writes, there’s no serialization
> on either the client or server to help guarantee the ordering of
> SETATTR with read operations.
>
> I’ve found a successful workaround by forcing the client to ignore
> post-op attrs in read replies. A stronger solution might simply set
> the “file attributes need update” flag in the inode if any file
> attribute mutation is noticed during a read completion.
That's different. We definitely should aim to fix this kind of issue
since you are talking about a single client being the only thing
accessing the file on the server.
How about the following patch?
8<---------------------------------------------------------------
>From d4c24528e8ac9c38d9ef98c5bbc15829f4032c0d Mon Sep 17 00:00:00 2001
From: Trond Myklebust <[email protected]>
Date: Wed, 25 Feb 2015 19:26:28 -0500
Subject: [PATCH] NFS: Quiesce reads before updating the file size after
truncating
Chuck Lever reports seeing readaheads racing with truncate operations
and causing the file size to be reverted. Fix is to quiesce reads
after truncating the file on the server, but before updating the
size on the client.
Reported-by: Chuck Lever <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/inode.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 83107be3dd01..c0aa87fd4766 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -565,6 +565,10 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset)
if (err)
goto out;
+ /* Quiesce reads before changing the file size */
+ invalidate_inode_pages2_range(&inode->i_mapping,
+ offset >> PAGE_CACHE_SHIFT;, -1);
+
spin_lock(&inode->i_lock);
i_size_write(inode, offset);
/* Optimisation */
--
2.1.0
On Wed, 2015-02-25 at 19:37 -0500, Trond Myklebust wrote:
> On Wed, 2015-02-25 at 17:32 -0500, Chuck Lever wrote:
> > FWIW it’s easy to reproduce a similar race with fsx, and I encounter
> > it frequently while running xfstests on fast NFS servers.
> >
> > fsx invokes ftruncate following a set of asynchronous reads
> > (generated possibly due to readahead). The reads are started first,
> > then the SETATTR, but they complete out of order.
> >
> > The SETATTR changes the test file’s size, and the completion
> > updates the file size in the client’s inode. Then the read requests
> > complete on the client and set the file’s size back to its old value.
> >
> > All it takes is one late read completion, and the cached file size
> > is corrupted. fsx detects the file size mismatch and terminates the
> > test. The file size is corrected by a subsequent GETATTR (say, an
> > “ls -l” to check it after fsx has terminated).
> >
> > While SETATTR blocks concurrent writes, there’s no serialization
> > on either the client or server to help guarantee the ordering of
> > SETATTR with read operations.
> >
> > I’ve found a successful workaround by forcing the client to ignore
> > post-op attrs in read replies. A stronger solution might simply set
> > the “file attributes need update” flag in the inode if any file
> > attribute mutation is noticed during a read completion.
>
> That's different. We definitely should aim to fix this kind of issue
> since you are talking about a single client being the only thing
> accessing the file on the server.
> How about the following patch?
Let me retry without the typos. The following will actually
compile... :-/
8<-------------------------------------------------------------
>From d8e01c1a6cf7cf29b5b03c98edf1a74007ae3119 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <[email protected]>
Date: Wed, 25 Feb 2015 19:26:28 -0500
Subject: [PATCH] NFS: Quiesce reads before updating the file size after
truncating
Chuck Lever reports seeing readaheads racing with truncate operations
and causing the file size to be reverted. Fix is to quiesce reads
after truncating the file on the server, but before updating the
size on the client.
Reported-by: Chuck Lever <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/inode.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 83107be3dd01..d0d74a72eb7d 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -565,6 +565,10 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset)
if (err)
goto out;
+ /* Quiesce reads before changing the file size */
+ invalidate_inode_pages2_range(inode->i_mapping,
+ offset >> PAGE_CACHE_SHIFT, -1);
+
spin_lock(&inode->i_lock);
i_size_write(inode, offset);
/* Optimisation */
--
2.1.0
On Wed, 2015-02-25 at 19:43 -0500, Trond Myklebust wrote:
> On Wed, 2015-02-25 at 19:37 -0500, Trond Myklebust wrote:
> > On Wed, 2015-02-25 at 17:32 -0500, Chuck Lever wrote:
> > > FWIW it’s easy to reproduce a similar race with fsx, and I encounter
> > > it frequently while running xfstests on fast NFS servers.
> > >
> > > fsx invokes ftruncate following a set of asynchronous reads
> > > (generated possibly due to readahead). The reads are started first,
> > > then the SETATTR, but they complete out of order.
> > >
> > > The SETATTR changes the test file’s size, and the completion
> > > updates the file size in the client’s inode. Then the read requests
> > > complete on the client and set the file’s size back to its old value.
> > >
> > > All it takes is one late read completion, and the cached file size
> > > is corrupted. fsx detects the file size mismatch and terminates the
> > > test. The file size is corrected by a subsequent GETATTR (say, an
> > > “ls -l” to check it after fsx has terminated).
> > >
> > > While SETATTR blocks concurrent writes, there’s no serialization
> > > on either the client or server to help guarantee the ordering of
> > > SETATTR with read operations.
> > >
> > > I’ve found a successful workaround by forcing the client to ignore
> > > post-op attrs in read replies. A stronger solution might simply set
> > > the “file attributes need update” flag in the inode if any file
> > > attribute mutation is noticed during a read completion.
> >
> > That's different. We definitely should aim to fix this kind of issue
> > since you are talking about a single client being the only thing
> > accessing the file on the server.
> > How about the following patch?
>
> Let me retry without the typos. The following will actually
> compile... :-/
Third version: this contains a minor optimisation so that we don't force
the re-read of the last page in the file. Also a little more detail in
the changelog.
8<--------------------------------------------------------------------
>From d2c822d64a68542665e4887b4d7bccd6e8e3a741 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <[email protected]>
Date: Wed, 25 Feb 2015 19:26:28 -0500
Subject: [PATCH] NFS: Quiesce reads before updating the file size after
truncating
Chuck Lever reports seeing readaheads racing with truncate operations
and causing the file size to be reverted. Fix is to quiesce reads
after truncating the file on the server, but before updating the
size on the client.
Note:
1) We believe that waiting for those reads that lie outside the new
file boundaries is sufficient, since this means any attempt to read
the data in those areas will trigger a new RPC call, which will be
ordered w.r.t. the size change on the server.
2) We do not move the call to truncate_pagecache(), since we want
to ensure that any partial page zeroing is ordered w.r.t. the
update of inode->i_size.
Reported-by: Chuck Lever <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/inode.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 83107be3dd01..daf3b1e183fe 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -565,6 +565,11 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset)
if (err)
goto out;
+ /* Quiesce reads that lie outside the new file size */
+ invalidate_inode_pages2_range(inode->i_mapping,
+ (offset + PAGE_CACHE_SHIFT - 1) >> PAGE_CACHE_SHIFT,
+ -1);
+
spin_lock(&inode->i_lock);
i_size_write(inode, offset);
/* Optimisation */
--
2.1.0
On Feb 25, 2015, at 8:27 PM, Trond Myklebust <[email protected]> wrote:
> On Wed, 2015-02-25 at 19:43 -0500, Trond Myklebust wrote:
>> On Wed, 2015-02-25 at 19:37 -0500, Trond Myklebust wrote:
>>> On Wed, 2015-02-25 at 17:32 -0500, Chuck Lever wrote:
>>>> FWIW it?s easy to reproduce a similar race with fsx, and I encounter
>>>> it frequently while running xfstests on fast NFS servers.
>>>>
>>>> fsx invokes ftruncate following a set of asynchronous reads
>>>> (generated possibly due to readahead). The reads are started first,
>>>> then the SETATTR, but they complete out of order.
>>>>
>>>> The SETATTR changes the test file?s size, and the completion
>>>> updates the file size in the client?s inode. Then the read requests
>>>> complete on the client and set the file?s size back to its old value.
>>>>
>>>> All it takes is one late read completion, and the cached file size
>>>> is corrupted. fsx detects the file size mismatch and terminates the
>>>> test. The file size is corrected by a subsequent GETATTR (say, an
>>>> ?ls -l? to check it after fsx has terminated).
>>>>
>>>> While SETATTR blocks concurrent writes, there?s no serialization
>>>> on either the client or server to help guarantee the ordering of
>>>> SETATTR with read operations.
>>>>
>>>> I?ve found a successful workaround by forcing the client to ignore
>>>> post-op attrs in read replies. A stronger solution might simply set
>>>> the ?file attributes need update? flag in the inode if any file
>>>> attribute mutation is noticed during a read completion.
>>>
>>> That's different. We definitely should aim to fix this kind of issue
>>> since you are talking about a single client being the only thing
>>> accessing the file on the server.
>>> How about the following patch?
>>
>> Let me retry without the typos. The following will actually
>> compile... :-/
>
> Third version: this contains a minor optimisation so that we don't force
> the re-read of the last page in the file. Also a little more detail in
> the changelog.
Thanks, I?ll give this a shot, hopefully today.
> 8<--------------------------------------------------------------------
> From d2c822d64a68542665e4887b4d7bccd6e8e3a741 Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <[email protected]>
> Date: Wed, 25 Feb 2015 19:26:28 -0500
> Subject: [PATCH] NFS: Quiesce reads before updating the file size after
> truncating
>
> Chuck Lever reports seeing readaheads racing with truncate operations
> and causing the file size to be reverted. Fix is to quiesce reads
> after truncating the file on the server, but before updating the
> size on the client.
>
> Note:
>
> 1) We believe that waiting for those reads that lie outside the new
> file boundaries is sufficient, since this means any attempt to read
> the data in those areas will trigger a new RPC call, which will be
> ordered w.r.t. the size change on the server.
>
> 2) We do not move the call to truncate_pagecache(), since we want
> to ensure that any partial page zeroing is ordered w.r.t. the
> update of inode->i_size.
>
> Reported-by: Chuck Lever <[email protected]>
> Signed-off-by: Trond Myklebust <[email protected]>
> ---
> fs/nfs/inode.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 83107be3dd01..daf3b1e183fe 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -565,6 +565,11 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset)
> if (err)
> goto out;
>
> + /* Quiesce reads that lie outside the new file size */
> + invalidate_inode_pages2_range(inode->i_mapping,
> + (offset + PAGE_CACHE_SHIFT - 1) >> PAGE_CACHE_SHIFT,
> + -1);
> +
> spin_lock(&inode->i_lock);
> i_size_write(inode, offset);
> /* Optimisation */
> --
> 2.1.0
>
>
>
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
Hi Chris!
Still in the context of deciding what should go in the FAQ, my
comments are below.
On Mar 2, 2015, at 10:19 AM, Chris Perl <[email protected]> wrote:
>> I?m in favor of staying more hand-wavy. Otherwise you will end up
>> making promises you don?t intend to keep ;-)
>
> FWIW, I'm in favor of at least some specifics. Something stating that
> the results of reading from a file while another client holds it open
> for write are undefined (point 3 of what was already proposed).
The language has to be very careful.
Opens for write are used frequently and by themselves do not cause
any damage. Corruption risk increases when actual writes occur,
followed by reads of the same file on other clients, without a
close and re-open.
Also, any published statement about this could lock us into a
particular behavior. That makes it harder to improve or change
(say, to address a bug) in the future.
Specifics can be discussed on the mailing list on a case-by-case
basis. The specifics depend on a bunch of things, for example:
- which clients are in play
- which NFS version is in use (delegations and open state)
- whether reading and writing is in the same byte range
- whether the file size is changing
- whether O_DIRECT and mapped files are in use
And so on. There could also be real bugs, which would become
apparent after discussion.
A general (and more explicit) warning about write sharing is
appropriate to add. I feel it would be difficult to get the
specifics right.
I could add something to A8 that says ?Detailed questions can be
directed to [email protected].? Do you feel you got a good
answer from the mailing list?
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
> I’m in favor of staying more hand-wavy. Otherwise you will end up
> making promises you don’t intend to keep ;-)
FWIW, I'm in favor of at least some specifics. Something stating that
the results of reading from a file while another client holds it open
for write are undefined (point 3 of what was already proposed).
On Mon, Mar 02, 2015 at 10:57:43AM -0500, Chuck Lever wrote:
> Hi Chris!
>
> Still in the context of deciding what should go in the FAQ, my
> comments are below.
>
>
> On Mar 2, 2015, at 10:19 AM, Chris Perl <[email protected]> wrote:
>
> >> I’m in favor of staying more hand-wavy. Otherwise you will end up
> >> making promises you don’t intend to keep ;-)
> >
> > FWIW, I'm in favor of at least some specifics. Something stating that
> > the results of reading from a file while another client holds it open
> > for write are undefined (point 3 of what was already proposed).
>
> The language has to be very careful.
>
> Opens for write are used frequently and by themselves do not cause
> any damage. Corruption risk increases when actual writes occur,
> followed by reads of the same file on other clients, without a
> close and re-open.
>
> Also, any published statement about this could lock us into a
> particular behavior. That makes it harder to improve or change
> (say, to address a bug) in the future.
>
> Specifics can be discussed on the mailing list on a case-by-case
> basis. The specifics depend on a bunch of things, for example:
>
> - which clients are in play
> - which NFS version is in use (delegations and open state)
> - whether reading and writing is in the same byte range
> - whether the file size is changing
> - whether O_DIRECT and mapped files are in use
These are cases we'd only need to go into if we wanted to give
*stronger* gaurantees than "all bets are off when you don't serialize
with write opens", yes?
And, sure, that would be interesting, but I think Chris is just asking
for a clearer statement of that basic requirement.
He's far from the first to assume that the results you'd get from
overlapping opens would be out of date but still reasonable in some
sense, and the FAQ could use a stronger statement that that's not the
case.
--b.
>
> And so on. There could also be real bugs, which would become
> apparent after discussion.
>
> A general (and more explicit) warning about write sharing is
> appropriate to add. I feel it would be difficult to get the
> specifics right.
>
> I could add something to A8 that says “Detailed questions can be
> directed to [email protected].” Do you feel you got a good
> answer from the mailing list?
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
Hi
On Mon, Mar 2, 2015 at 9:58 PM, J. Bruce Fields <[email protected]> wrote:
> On Mon, Mar 02, 2015 at 10:57:43AM -0500, Chuck Lever wrote:
>> Hi Chris!
>>
>> Still in the context of deciding what should go in the FAQ, my
>> comments are below.
>>
>>
>> On Mar 2, 2015, at 10:19 AM, Chris Perl <[email protected]> wrote:
>>
>> >> I’m in favor of staying more hand-wavy. Otherwise you will end up
>> >> making promises you don’t intend to keep ;-)
>> >
>> > FWIW, I'm in favor of at least some specifics. Something stating that
>> > the results of reading from a file while another client holds it open
>> > for write are undefined (point 3 of what was already proposed).
>>
>> The language has to be very careful.
>>
>> Opens for write are used frequently and by themselves do not cause
>> any damage. Corruption risk increases when actual writes occur,
>> followed by reads of the same file on other clients, without a
>> close and re-open.
>>
>> Also, any published statement about this could lock us into a
>> particular behavior. That makes it harder to improve or change
>> (say, to address a bug) in the future.
>>
>> Specifics can be discussed on the mailing list on a case-by-case
>> basis. The specifics depend on a bunch of things, for example:
>>
>> - which clients are in play
>> - which NFS version is in use (delegations and open state)
>> - whether reading and writing is in the same byte range
>> - whether the file size is changing
>> - whether O_DIRECT and mapped files are in use
>
> These are cases we'd only need to go into if we wanted to give
> *stronger* gaurantees than "all bets are off when you don't serialize
> with write opens", yes?
>
> And, sure, that would be interesting, but I think Chris is just asking
> for a clearer statement of that basic requirement.
>
> He's far from the first to assume that the results you'd get from
> overlapping opens would be out of date but still reasonable in some
> sense, and the FAQ could use a stronger statement that that's not the
> case.
>
> --b.
Yes, writing something like;
tail -f foo
doesn't work if readers and writer aren't on the same client
IMO it's the most surprising for someone with a posix background on local fs.
On Mar 2, 2015, at 3:58 PM, J. Bruce Fields <[email protected]> wrote:
> On Mon, Mar 02, 2015 at 10:57:43AM -0500, Chuck Lever wrote:
>> Hi Chris!
>>
>> Still in the context of deciding what should go in the FAQ, my
>> comments are below.
>>
>>
>> On Mar 2, 2015, at 10:19 AM, Chris Perl <[email protected]> wrote:
>>
>>>> I?m in favor of staying more hand-wavy. Otherwise you will end up
>>>> making promises you don?t intend to keep ;-)
>>>
>>> FWIW, I'm in favor of at least some specifics. Something stating that
>>> the results of reading from a file while another client holds it open
>>> for write are undefined (point 3 of what was already proposed).
>>
>> The language has to be very careful.
>>
>> Opens for write are used frequently and by themselves do not cause
>> any damage. Corruption risk increases when actual writes occur,
>> followed by reads of the same file on other clients, without a
>> close and re-open.
>>
>> Also, any published statement about this could lock us into a
>> particular behavior. That makes it harder to improve or change
>> (say, to address a bug) in the future.
>>
>> Specifics can be discussed on the mailing list on a case-by-case
>> basis. The specifics depend on a bunch of things, for example:
>>
>> - which clients are in play
>> - which NFS version is in use (delegations and open state)
>> - whether reading and writing is in the same byte range
>> - whether the file size is changing
>> - whether O_DIRECT and mapped files are in use
>
> These are cases we'd only need to go into if we wanted to give
> *stronger* gaurantees than "all bets are off when you don't serialize
> with write opens", yes?
> And, sure, that would be interesting, but I think Chris is just asking
> for a clearer statement of that basic requirement.
I read Chris?s e-mail as a request for more detail in the answer
I proposed before. Maybe I?m wrong.
Is this not sufficient:
?Because NFS is not a cluster or ?single system image? filesystem,
applications must provide proper serialization of reads and writes
among multiple clients to ensure correct application behavior and
prevent corruption of file data. The close-to-open mechanism is not
adequate in the presence of concurrent opens for write when multiple
clients are involved.?
Perhaps the ?presence of concurrent opens for write? should be
replaced with ?presence of concurrent writes.? But what else needs
to be said?
> He's far from the first to assume that the results you'd get from
> overlapping opens would be out of date but still reasonable in some
> sense, and the FAQ could use a stronger statement that that's not the
> case.
No disagreement there. I?m just trying to figure out the right
level of detail.
> --b.
>
>>
>> And so on. There could also be real bugs, which would become
>> apparent after discussion.
>>
>> A general (and more explicit) warning about write sharing is
>> appropriate to add. I feel it would be difficult to get the
>> specifics right.
>>
>> I could add something to A8 that says ?Detailed questions can be
>> directed to [email protected].? Do you feel you got a good
>> answer from the mailing list?
>>
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>>
>>
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
On 03/02/2015 11:33 PM, didier wrote:
<>
> Yes, writing something like;
> tail -f foo
> doesn't work if readers and writer aren't on the same client
>
s/doesn't/might not/
There are some implementations ie server/client combination that this works
just perfectly fine.
> IMO it's the most surprising for someone with a posix background on local fs.
Cheers
Boaz
Just to quickly answer the earlier question about whether I have
received a good answer from this list: Yes, I have, and I appreciate
all the time and discussion spent helping me understand what I've been
seeing.
> I read Chris’s e-mail as a request for more detail in the answer
> I proposed before. Maybe I’m wrong.
>
> Is this not sufficient:
>
> “Because NFS is not a cluster or “single system image” filesystem,
> applications must provide proper serialization of reads and writes
> among multiple clients to ensure correct application behavior and
> prevent corruption of file data. The close-to-open mechanism is not
> adequate in the presence of concurrent opens for write when multiple
> clients are involved.”
>
> Perhaps the “presence of concurrent opens for write” should be
> replaced with “presence of concurrent writes.” But what else needs
> to be said?
I'm trying to decide if, having read the above in the FAQ, it would
have been obvious to me that what I was seeing was expected. I don't
know that it would have been. That paragraph seems to impart to me
the idea that if you don't synchronize your readers and writers you
might get weird results, much like a local file system. However, I
thought the way in which my updates were happening made this ok, as I
was only ever appending data to a given file.
I guess I was hoping for something more along the lines of (this would
be meant to be inserted after paragraph 6 of A8 in the FAQ):
Furthermore, in the presence of multiple clients, at least one of
which is writing to a given file, close-to-open cache consistency
makes no guarantee's about the data that might be returned from the
cache on a client issuing reads concurrent with those writes. It may
return stale data or it may return incorrect data.
On Mar 3, 2015, at 8:29 AM, Chris Perl <[email protected]> wrote:
> Just to quickly answer the earlier question about whether I have
> received a good answer from this list: Yes, I have, and I appreciate
> all the time and discussion spent helping me understand what I've been
> seeing.
>
>> I read Chris?s e-mail as a request for more detail in the answer
>> I proposed before. Maybe I?m wrong.
>>
>> Is this not sufficient:
>>
>> ?Because NFS is not a cluster or ?single system image? filesystem,
>> applications must provide proper serialization of reads and writes
>> among multiple clients to ensure correct application behavior and
>> prevent corruption of file data. The close-to-open mechanism is not
>> adequate in the presence of concurrent opens for write when multiple
>> clients are involved.?
>>
>> Perhaps the ?presence of concurrent opens for write? should be
>> replaced with ?presence of concurrent writes.? But what else needs
>> to be said?
>
> I'm trying to decide if, having read the above in the FAQ, it would
> have been obvious to me that what I was seeing was expected.
That?s exactly the right question to ask.
> I don't
> know that it would have been. That paragraph seems to impart to me
> the idea that if you don't synchronize your readers and writers you
> might get weird results, much like a local file system. However, I
> thought the way in which my updates were happening made this ok, as I
> was only ever appending data to a given file.
>
> I guess I was hoping for something more along the lines of (this would
> be meant to be inserted after paragraph 6 of A8 in the FAQ):
>
> Furthermore, in the presence of multiple clients, at least one of
> which is writing to a given file, close-to-open cache consistency
> makes no guarantee's about the data that might be returned from the
> cache on a client issuing reads concurrent with those writes. It may
> return stale data or it may return incorrect data.
This works for me. Bruce, Trond, any thoughts?
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
On Tue, Mar 3, 2015 at 10:30 AM, Chuck Lever <[email protected]> wrote:
>
> On Mar 3, 2015, at 8:29 AM, Chris Perl <[email protected]> wrote:
>
>> Just to quickly answer the earlier question about whether I have
>> received a good answer from this list: Yes, I have, and I appreciate
>> all the time and discussion spent helping me understand what I've been
>> seeing.
>>
>>> I read Chris’s e-mail as a request for more detail in the answer
>>> I proposed before. Maybe I’m wrong.
>>>
>>> Is this not sufficient:
>>>
>>> “Because NFS is not a cluster or “single system image” filesystem,
>>> applications must provide proper serialization of reads and writes
>>> among multiple clients to ensure correct application behavior and
>>> prevent corruption of file data. The close-to-open mechanism is not
>>> adequate in the presence of concurrent opens for write when multiple
>>> clients are involved.”
>>>
>>> Perhaps the “presence of concurrent opens for write” should be
>>> replaced with “presence of concurrent writes.” But what else needs
>>> to be said?
>>
>> I'm trying to decide if, having read the above in the FAQ, it would
>> have been obvious to me that what I was seeing was expected.
>
> That’s exactly the right question to ask.
>
>> I don't
>> know that it would have been. That paragraph seems to impart to me
>> the idea that if you don't synchronize your readers and writers you
>> might get weird results, much like a local file system. However, I
>> thought the way in which my updates were happening made this ok, as I
>> was only ever appending data to a given file.
>>
>> I guess I was hoping for something more along the lines of (this would
>> be meant to be inserted after paragraph 6 of A8 in the FAQ):
>>
>> Furthermore, in the presence of multiple clients, at least one of
>> which is writing to a given file, close-to-open cache consistency
>> makes no guarantee's about the data that might be returned from the
>> cache on a client issuing reads concurrent with those writes. It may
>> return stale data or it may return incorrect data.
>
> This works for me. Bruce, Trond, any thoughts?
>
I'd say we can give stronger advice. As I said earlier, close-to-open
is a known model that is implemented by most NFS clients that want to
cache data, so I see no reason not to go into more detail. If ever we
change the Linux client behaviour in ways that differ, then we can no
longer claim to be doing close-to-open.
So here are the things we can document:
1) The client will revalidate its cached attributes and data on
open(). If the file or directory has changed on the server, the data
cache will be invalidated/flushed.
2) The client will revalidate its cached attributes on close().
3) Between open() and close(), the behaviour w.r.t. cache revalidation
is undefined. The client may on occasion check whether or not its
cached data is valid, but applications must not rely on such
behaviour. If the file is changed on the server during that time, then
the behaviour of read()/readdir() on the client is undefined. In
particular, note that read() may return stale data and/or holes.
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
On Mar 3, 2015, at 12:44 PM, Trond Myklebust <[email protected]> wrote:
> On Tue, Mar 3, 2015 at 10:30 AM, Chuck Lever <[email protected]> wrote:
>>
>> On Mar 3, 2015, at 8:29 AM, Chris Perl <[email protected]> wrote:
>>
>>> Just to quickly answer the earlier question about whether I have
>>> received a good answer from this list: Yes, I have, and I appreciate
>>> all the time and discussion spent helping me understand what I've been
>>> seeing.
>>>
>>>> I read Chris?s e-mail as a request for more detail in the answer
>>>> I proposed before. Maybe I?m wrong.
>>>>
>>>> Is this not sufficient:
>>>>
>>>> ?Because NFS is not a cluster or ?single system image? filesystem,
>>>> applications must provide proper serialization of reads and writes
>>>> among multiple clients to ensure correct application behavior and
>>>> prevent corruption of file data. The close-to-open mechanism is not
>>>> adequate in the presence of concurrent opens for write when multiple
>>>> clients are involved.?
>>>>
>>>> Perhaps the ?presence of concurrent opens for write? should be
>>>> replaced with ?presence of concurrent writes.? But what else needs
>>>> to be said?
>>>
>>> I'm trying to decide if, having read the above in the FAQ, it would
>>> have been obvious to me that what I was seeing was expected.
>>
>> That?s exactly the right question to ask.
>>
>>> I don't
>>> know that it would have been. That paragraph seems to impart to me
>>> the idea that if you don't synchronize your readers and writers you
>>> might get weird results, much like a local file system. However, I
>>> thought the way in which my updates were happening made this ok, as I
>>> was only ever appending data to a given file.
>>>
>>> I guess I was hoping for something more along the lines of (this would
>>> be meant to be inserted after paragraph 6 of A8 in the FAQ):
>>>
>>> Furthermore, in the presence of multiple clients, at least one of
>>> which is writing to a given file, close-to-open cache consistency
>>> makes no guarantee's about the data that might be returned from the
>>> cache on a client issuing reads concurrent with those writes. It may
>>> return stale data or it may return incorrect data.
>>
>> This works for me. Bruce, Trond, any thoughts?
>>
>
> I'd say we can give stronger advice. As I said earlier, close-to-open
> is a known model that is implemented by most NFS clients that want to
> cache data, so I see no reason not to go into more detail. If ever we
> change the Linux client behaviour in ways that differ, then we can no
> longer claim to be doing close-to-open.
Fair enough.
> So here are the things we can document:
>
> 1) The client will revalidate its cached attributes and data on
> open(). If the file or directory has changed on the server, the data
> cache will be invalidated/flushed.
> 2) The client will revalidate its cached attributes on close().
This is paragraph 3 of A8. It can be clarified.
> 3) Between open() and close(), the behaviour w.r.t. cache revalidation
> is undefined. The client may on occasion check whether or not its
> cached data is valid, but applications must not rely on such
> behaviour.
That?s a good addition.
I think a direct warning, a la what Chris posted, is also advisable,
in a way that is not buried in operational detail. The point is to
add a tl;dr that is clear even to those who skim this answer.
> If the file is changed on the server during that time, then
> the behaviour of read()/readdir() on the client is undefined. In
> particular, note that read() may return stale data and/or holes.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com