2005-10-26 15:28:06

by Lever, Charles

[permalink] [raw]
Subject: RE: Data coherency trouble with multiple clients, on2.6.14-rc5

> > The client mount options are: =
noac,nocto,sync,hard,intr,rw,nfsvers=3D3
> > The server exports flags are:=20
> (rw,no_root_squash,sync,no_wdelay,insecure)
> >=20
> > Various combinations of the above mount options make no difference.
> > I've also tried with a 2.6.14-rc5 server, and the problem persists.
>=20
> It sounds like you are really looking for the O_DIRECT (i.e. uncached)
> file I/O mode.

i agree that O_DIRECT is the right solution. however, even with "noac"
i would expect mr. duffy's workload to behave correctly most of the
time. sounds like he is able to make it fail very easily.


-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2005-10-26 16:51:26

by Trond Myklebust

[permalink] [raw]
Subject: RE: Data coherency trouble with multiple clients, on2.6.14-rc5

on den 26.10.2005 klokka 08:27 (-0700) skreiv Lever, Charles:

> i agree that O_DIRECT is the right solution. however, even with "noac"
> i would expect mr. duffy's workload to behave correctly most of the
> time. sounds like he is able to make it fail very easily.

Why would you expect that?

"noac" does not turn off data caching, nor does it change the policy
that the client will not invalidate the data cache while it is holding
the file open for write.

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-10-26 18:45:52

by Peter Staubach

[permalink] [raw]
Subject: Re: Data coherency trouble with multiple clients, on2.6.14-rc5

Trond Myklebust wrote:

>on den 26.10.2005 klokka 08:27 (-0700) skreiv Lever, Charles:
>
>
>
>>i agree that O_DIRECT is the right solution. however, even with "noac"
>>i would expect mr. duffy's workload to behave correctly most of the
>>time. sounds like he is able to make it fail very easily.
>>
>>
>
>Why would you expect that?
>
>"noac" does not turn off data caching, nor does it change the policy
>that the client will not invalidate the data cache while it is holding
>the file open for write.
>

It seems to me that the policy should be to allow cache
validation/invalidation
unless the file is mmap'd for writing or if there are active WRITEs
outstanding.
Simply having the file open for write should not affect the consistency
model.

Thanx...

ps


-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-10-26 19:09:03

by Trond Myklebust

[permalink] [raw]
Subject: Re: Data coherency trouble with multiple clients, on2.6.14-rc5

on den 26.10.2005 klokka 14:45 (-0400) skreiv Peter Staubach:

> It seems to me that the policy should be to allow cache
> validation/invalidation
> unless the file is mmap'd for writing or if there are active WRITEs
> outstanding.
> Simply having the file open for write should not affect the consistency
> model.

That is what we did for 2.4.x, but what kind of extra guarantees does
that really bring you? You cannot actually rely on it to provide
stronger caching semantics than the close-to-open case.

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-10-26 19:53:32

by Peter Staubach

[permalink] [raw]
Subject: Re: Data coherency trouble with multiple clients, on2.6.14-rc5

Trond Myklebust wrote:

>on den 26.10.2005 klokka 14:45 (-0400) skreiv Peter Staubach:
>
>
>
>>It seems to me that the policy should be to allow cache
>>validation/invalidation
>>unless the file is mmap'd for writing or if there are active WRITEs
>>outstanding.
>>Simply having the file open for write should not affect the consistency
>>model.
>>
>>
>
>That is what we did for 2.4.x, but what kind of extra guarantees does
>that really bring you? You cannot actually rely on it to provide
>stronger caching semantics than the close-to-open case.
>

This brings lots of extra guarantees, actually. Just because the file is
open for writing does not mean that there are any dirty pages hanging
around waiting to be written. And, even if there are, they will get
flushed when the conflict is detected. Last there one there wins. This
is even the policy when local processes conflict on the same file in the
same region.

This policy would address the situation that was reported here.

This policy will definitely result in _much_ stronger caching semantics
than does close-to-open. These two policies together can usually result
in reasonable cache consistency, enough for most applications. Applications
which need stronger cache consistency should be advisory locking in order
to synchronize access to the file.

ps


-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-10-26 21:06:09

by Trond Myklebust

[permalink] [raw]
Subject: Re: Data coherency trouble with multiple clients, on2.6.14-rc5

on den 26.10.2005 klokka 15:53 (-0400) skreiv Peter Staubach:
> This brings lots of extra guarantees, actually. Just because the file is
> open for writing does not mean that there are any dirty pages hanging
> around waiting to be written. And, even if there are, they will get
> flushed when the conflict is detected. Last there one there wins. This
> is even the policy when local processes conflict on the same file in the
> same region.
>
> This policy would address the situation that was reported here.
>
> This policy will definitely result in _much_ stronger caching semantics
> than does close-to-open. These two policies together can usually result
> in reasonable cache consistency, enough for most applications. Applications
> which need stronger cache consistency should be advisory locking in order
> to synchronize access to the file.

Sure, but the big issue here is how to actually detect conflicts (and
avoid excessive false positives).

NFSv3 does in theory give you the option of detecting conflicts using
weak cache consistency. In practice, write reordering and the fact that
most servers violate the requirement given by RFC1813 that pre/post-op
attributes should be atomic w.r.t. the main operation prevents you from
closing the hole.
NFSv2 and NFSv4 don't even have support for WCC, so your detection
scheme ends up being very dependent on one particular version of NFS.

Basically, what I'm saying is that as long as we cannot implement the
above ideal, we should not be issuing promises to application developers
that they can rely on it. O_DIRECT was specifically developed in order
to give database implementers a reliable uncached I/O interface, and so
that is what we should direct them towards.
The worst thing to do when someone asks IMHO is to reply that "we can
almost but not quite fix noac".

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-10-26 21:23:03

by Peter Staubach

[permalink] [raw]
Subject: Re: Data coherency trouble with multiple clients, on2.6.14-rc5

Trond Myklebust wrote:

>on den 26.10.2005 klokka 15:53 (-0400) skreiv Peter Staubach:
>
>
>>This brings lots of extra guarantees, actually. Just because the file is
>>open for writing does not mean that there are any dirty pages hanging
>>around waiting to be written. And, even if there are, they will get
>>flushed when the conflict is detected. Last there one there wins. This
>>is even the policy when local processes conflict on the same file in the
>>same region.
>>
>>This policy would address the situation that was reported here.
>>
>>This policy will definitely result in _much_ stronger caching semantics
>>than does close-to-open. These two policies together can usually result
>>in reasonable cache consistency, enough for most applications. Applications
>>which need stronger cache consistency should be advisory locking in order
>>to synchronize access to the file.
>>
>>
>
>Sure, but the big issue here is how to actually detect conflicts (and
>avoid excessive false positives).
>
>
>

I would say that it is better to be safe and then fast. Some cache
invalidations for false positives are better than missing some which
were required.

>NFSv3 does in theory give you the option of detecting conflicts using
>weak cache consistency. In practice, write reordering and the fact that
>most servers violate the requirement given by RFC1813 that pre/post-op
>attributes should be atomic w.r.t. the main operation prevents you from
>closing the hole.
>NFSv2 and NFSv4 don't even have support for WCC, so your detection
>scheme ends up being very dependent on one particular version of NFS.
>
>
>

Actually NFSv4 does have an attribute that the client can use, doesn't it?
Something like change_attr or some such?

The write reordering issue only exists for multiple concurrent operations
such as WRITE operations. I will agree, that if the wcc_data for WRITE
operations is used, then many false positives will probably occur. However,
useful and valid cache validations can be done using GETATTR or other
operations such as ACCESS or LOOKUP, even while a file is open for writing.

>Basically, what I'm saying is that as long as we cannot implement the
>above ideal, we should not be issuing promises to application developers
>that they can rely on it. O_DIRECT was specifically developed in order
>to give database implementers a reliable uncached I/O interface, and so
>that is what we should direct them towards.
>The worst thing to do when someone asks IMHO is to reply that "we can
>almost but not quite fix noac".
>

O_DIRECT is pretty much only useful to the database folks because of the
lack of readahead and write behind which kills performance. They can
utilize O_DIRECT because they use multiple contexts or AIO to issue the
i/o requests.

The application developers are already aware of the loose cache consistency
that NFS offers. This is not a reason to loosen it further though. We
can and should do the best job that we can. We have to make some
assumptions
about how well NFS servers implement the correct semantics. If an NFS
server is truly broken, then let's get that NFS server fixed. Avoiding
useful semantics because some servers in the market may not get them
right seems self defeating to me and just futhers the myth that NFS is
not useful as a distributed file system.

Thanx...

ps


-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-10-26 21:58:26

by Trond Myklebust

[permalink] [raw]
Subject: Re: Data coherency trouble with multiple clients, on2.6.14-rc5

on den 26.10.2005 klokka 17:22 (-0400) skreiv Peter Staubach:

> I would say that it is better to be safe and then fast. Some cache
> invalidations for false positives are better than missing some which
> were required.

That really does depend on the application. For many setups caching is
vital to ensure scalability on the server side. For instance, someone
running an HPC cluster for an animation studio may be willing to
tolerate the odd caching error if that means running more clients per
server.

> >NFSv2 and NFSv4 don't even have support for WCC, so your detection
> >scheme ends up being very dependent on one particular version of NFS.
> >
> >
> >
>
> Actually NFSv4 does have an attribute that the client can use, doesn't it?
> Something like change_attr or some such?

The NFSv4 change attribute may be returned atomically for some
operations (mainly those that modify a directory, such as CREATE,
OPEN,...).
Unfortunately it is not returned atomically for the case of the WRITE
operation.

> The write reordering issue only exists for multiple concurrent operations
> such as WRITE operations. I will agree, that if the wcc_data for WRITE
> operations is used, then many false positives will probably occur. However,
> useful and valid cache validations can be done using GETATTR or other
> operations such as ACCESS or LOOKUP, even while a file is open for writing.

Agreed, and I am willing to relax the current restrictions for those
cases (in fact I happen to be testing the patch for that today).
That will not, however, suffice to ensure that the cache on any given
NFS client will never return stale data when you set the "noac" mount
flag.

> >Basically, what I'm saying is that as long as we cannot implement the
> >above ideal, we should not be issuing promises to application developers
> >that they can rely on it. O_DIRECT was specifically developed in order
> >to give database implementers a reliable uncached I/O interface, and so
> >that is what we should direct them towards.
> >The worst thing to do when someone asks IMHO is to reply that "we can
> >almost but not quite fix noac".
> >
>
> O_DIRECT is pretty much only useful to the database folks because of the
> lack of readahead and write behind which kills performance. They can
> utilize O_DIRECT because they use multiple contexts or AIO to issue the
> i/o requests.

So who else out there really needs the ability to have multiple readers
and writers per file without using locking?
I'm not asking rhetorically... I actually do need to knock up a
presentation on this particular topic over the course of the next week.

> The application developers are already aware of the loose cache consistency
> that NFS offers. This is not a reason to loosen it further though. We
> can and should do the best job that we can. We have to make some
> assumptions
> about how well NFS servers implement the correct semantics. If an NFS
> server is truly broken, then let's get that NFS server fixed. Avoiding
> useful semantics because some servers in the market may not get them
> right seems self defeating to me and just futhers the myth that NFS is
> not useful as a distributed file system.

I am not opposed to strengthening the NFS cache consistency. I just want
to ensure that we have clearly articulated rules that developers can
rely upon.

See my recent proposal for NFSv4 "byte range delegations" at the IETF
RFC website. If people want NFS to have full posix cache semantics, then
we can certainly add that capability. ;-)

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-10-27 12:25:45

by Peter Staubach

[permalink] [raw]
Subject: Re: Data coherency trouble with multiple clients, on2.6.14-rc5

Trond Myklebust wrote:

>on den 26.10.2005 klokka 17:22 (-0400) skreiv Peter Staubach:
>
>
>
>>I would say that it is better to be safe and then fast. Some cache
>>invalidations for false positives are better than missing some which
>>were required.
>>
>>
>
>That really does depend on the application. For many setups caching is
>vital to ensure scalability on the server side. For instance, someone
>running an HPC cluster for an animation studio may be willing to
>tolerate the odd caching error if that means running more clients per
>server.
>
>

True, although there are other mechanisms such as nocto and increasing the
attribute cache timeouts to reduce the over the wire traffic. I would
suspect that those folks would not knowingly agree to a system which allowed
known cache inconsistencies, unless they themselves had configured the
system to do so.


>
>
>>>NFSv2 and NFSv4 don't even have support for WCC, so your detection
>>>scheme ends up being very dependent on one particular version of NFS.
>>>
>>>
>>>
>>>
>>>
>>Actually NFSv4 does have an attribute that the client can use, doesn't it?
>>Something like change_attr or some such?
>>
>>
>
>The NFSv4 change attribute may be returned atomically for some
>operations (mainly those that modify a directory, such as CREATE,
>OPEN,...).
>Unfortunately it is not returned atomically for the case of the WRITE
>operation.
>
>
>

Unfortunate. I guess that we could hope for delegations with callback
support to provide the stronger consistency in certain situations and
configurations.

>>The write reordering issue only exists for multiple concurrent operations
>>such as WRITE operations. I will agree, that if the wcc_data for WRITE
>>operations is used, then many false positives will probably occur. However,
>>useful and valid cache validations can be done using GETATTR or other
>>operations such as ACCESS or LOOKUP, even while a file is open for writing.
>>
>>
>
>Agreed, and I am willing to relax the current restrictions for those
>cases (in fact I happen to be testing the patch for that today).
>That will not, however, suffice to ensure that the cache on any given
>NFS client will never return stale data when you set the "noac" mount
>flag.
>
>
>

Very true. There is no guarantee that the NFS client, except perhaps for
NFSv4, will be completely consistent. It can probably be made "close
enough" for many applications, but never totally consistent. There is
always that window in between looking to see if the file has changed and
then using the cached information.

>>>Basically, what I'm saying is that as long as we cannot implement the
>>>above ideal, we should not be issuing promises to application developers
>>>that they can rely on it. O_DIRECT was specifically developed in order
>>>to give database implementers a reliable uncached I/O interface, and so
>>>that is what we should direct them towards.
>>>The worst thing to do when someone asks IMHO is to reply that "we can
>>>almost but not quite fix noac".
>>>
>>>
>>>
>>O_DIRECT is pretty much only useful to the database folks because of the
>>lack of readahead and write behind which kills performance. They can
>>utilize O_DIRECT because they use multiple contexts or AIO to issue the
>>i/o requests.
>>
>>
>
>So who else out there really needs the ability to have multiple readers
>and writers per file without using locking?
>I'm not asking rhetorically... I actually do need to knock up a
>presentation on this particular topic over the course of the next week.
>
>
>

The case of read-ahead and write-behind are the most common cases that I
can think of. It is the write-behind case which causes the most potential
"false positive" cases in NFS implementations that I have seen.

These are, of course, not visible to the application.

I guess that there is always things like log files, with append mode
writes, but I don't think that anyone has really figured out to make
append mode work right yet.

>>The application developers are already aware of the loose cache consistency
>>that NFS offers. This is not a reason to loosen it further though. We
>>can and should do the best job that we can. We have to make some
>>assumptions
>>about how well NFS servers implement the correct semantics. If an NFS
>>server is truly broken, then let's get that NFS server fixed. Avoiding
>>useful semantics because some servers in the market may not get them
>>right seems self defeating to me and just futhers the myth that NFS is
>>not useful as a distributed file system.
>>
>>
>
>I am not opposed to strengthening the NFS cache consistency. I just want
>to ensure that we have clearly articulated rules that developers can
>rely upon.
>
>See my recent proposal for NFSv4 "byte range delegations" at the IETF
>RFC website. If people want NFS to have full posix cache semantics, then
>we can certainly add that capability. ;-)
>

I'll have to take a closer look. Is this a performance thing over the
normal whole file delegations?

Thanx...

ps


-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-10-27 12:53:43

by Trond Myklebust

[permalink] [raw]
Subject: Re: Data coherency trouble with multiple clients, on2.6.14-rc5

to den 27.10.2005 klokka 08:25 (-0400) skreiv Peter Staubach:
> >See my recent proposal for NFSv4 "byte range delegations" at the IETF
> >RFC website. If people want NFS to have full posix cache semantics, then
> >we can certainly add that capability. ;-)
> >
>
> I'll have to take a closer look. Is this a performance thing over the
> normal whole file delegations?

No. It is dealing with caching in an environment where you may have
possibly multiple readers and writers to the same file. In such an
environment, the file delegation model cannot function at all.

Basically the proposal adds support for safe write-behind and read
caching by allowing the client to set up a lease on ranges of cached
data. The server is given the ability to notify the client whenever it
needs to write back the dirty data and/or clear its read cache.
See
http://www.ietf.org/internet-drafts/draft-myklebust-nfsv4-byte-range-delegations-00.txt

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-10-27 14:25:57

by Calum Mackay

[permalink] [raw]
Subject: Re: Data coherency trouble with multiple clients, on2.6.14-rc5

Trond Myklebust wrote:
>> Actually NFSv4 does have an attribute that the client can use, doesn't it?
>> Something like change_attr or some such?
>
> The NFSv4 change attribute may be returned atomically for some
> operations (mainly those that modify a directory, such as CREATE,
> OPEN,...).
> Unfortunately it is not returned atomically for the case of the WRITE
> operation.

I'm imagine that's for performance reasons? Do we have any experience as
to what sort of difference this makes, in practice?

cheers,
calum.


-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-10-27 15:34:06

by Trond Myklebust

[permalink] [raw]
Subject: Re: Data coherency trouble with multiple clients, on2.6.14-rc5

to den 27.10.2005 klokka 15:25 (+0100) skreiv Calum Mackay:
> > The NFSv4 change attribute may be returned atomically for some
> > operations (mainly those that modify a directory, such as CREATE,
> > OPEN,...).
> > Unfortunately it is not returned atomically for the case of the WRITE
> > operation.
>
> I'm imagine that's for performance reasons? Do we have any experience as
> to what sort of difference this makes, in practice?

If your clients have to collect the pre- and post-op change attribute
using GETATTR ops before and after the WRITE op in the "write" COMPOUND,
then there can be a fairly large race.
Most servers will serialise the WRITE ops w.r.t. each other, so a
typical race would be 2 server threads processing simultaneous write
requests to the same file from 2 different clients:

Client 1 Client 2
-------- --------
OP_GETATTR OP_GETATTR
OP_WRITE (bumps change attr)
OP_WRITE (bumps change attr) OP_GETATTR
OP_GETATTR

When this happens, client 1 and client 2 both see the correct pre-op
change attribute, but client 1 just sees the post-op change attribute
that results from _both_ writes (client 2 may or may not do the same).
It will therefore miss the fact that its cache is now invalid.
Probably a rare occurrence in general, but definitely a problem if you
are writing a database application.

The same race could not occur with NFSv3's weak cache consistency
(assuming the servers implement it according to spec).

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs