2020-06-08 21:20:31

by J. Bruce Fields

[permalink] [raw]
Subject: client caching and locks

What does the client do to its cache when it writes to a locked range?

The RFC:

https://tools.ietf.org/html/rfc7530#section-10.3.2

seems to apply that you should get something like local-filesystem
semantics if you write-lock any range that you write to and read-lock
any range that you read from.

But I see a report that when applications write to non-overlapping
ranges (while taking locks over those ranges), they don't see each
other's updates.

I think for simultaneous non-overlapping writes to work that way, the
client would need to invalidate its cache on unlock (except for the
locked range). But i can't tell what the client's designed to do.

--b.


2020-06-18 10:09:16

by [email protected]

[permalink] [raw]
Subject: RE: client caching and locks

> What does the client do to its cache when it writes to a locked range?
>
> The RFC:
>
> https://tools.ietf.org/html/rfc7530#section-10.3.2
>
> seems to apply that you should get something like local-filesystem
> semantics if you write-lock any range that you write to and read-lock
> any range that you read from.
>
> But I see a report that when applications write to non-overlapping
> ranges (while taking locks over those ranges), they don't see each
> other's updates.
>
> I think for simultaneous non-overlapping writes to work that way, the
> client would need to invalidate its cache on unlock (except for the
> locked range). But i can't tell what the client's designed to do.

Simultaneous non-overlapping WRITEs is not taken into consideration in RFC7530.
I personally think it is not necessary to deal with this case by modifying the kernel because
the application on the client can be implemented to avoid it.

Serialization of the simultaneous operations may be one of the ways.
Just before the write operation, each client locks and reads the overlapped range of data
instead of obtaining a lock in their own non-overlapping range.
They can reflect updates from other clients in this case.

Yuki Inoguchi

>
> --b.

2020-06-18 14:31:25

by Trond Myklebust

[permalink] [raw]
Subject: Re: client caching and locks

On Thu, 2020-06-18 at 09:54 +0000, [email protected] wrote:
> > What does the client do to its cache when it writes to a locked
> > range?
> >
> > The RFC:
> >
> > https://tools.ietf.org/html/rfc7530#section-10.3.2
> >
> > seems to apply that you should get something like local-filesystem
> > semantics if you write-lock any range that you write to and read-
> > lock
> > any range that you read from.
> >
> > But I see a report that when applications write to non-overlapping
> > ranges (while taking locks over those ranges), they don't see each
> > other's updates.
> >
> > I think for simultaneous non-overlapping writes to work that way,
> > the
> > client would need to invalidate its cache on unlock (except for the
> > locked range). But i can't tell what the client's designed to do.
>
> Simultaneous non-overlapping WRITEs is not taken into consideration
> in RFC7530.
> I personally think it is not necessary to deal with this case by
> modifying the kernel because
> the application on the client can be implemented to avoid it.
>
> Serialization of the simultaneous operations may be one of the ways.
> Just before the write operation, each client locks and reads the
> overlapped range of data
> instead of obtaining a lock in their own non-overlapping range.
> They can reflect updates from other clients in this case.
>
> Yuki Inoguchi
>
> > --b.

See the function 'fs/nfs/file.c:do_setlk()'. We flush dirty file data
both before and after taking the byte range lock. After taking the
lock, we force a revalidation of the data before returning control to
the application (unless there is a delegation that allows us to cache
more aggressively).

In addition, if you look at fs/nfs/file.c:do_unlk() you'll note that we
force a flush of all dirty file data before releasing the lock.

Finally, note that we turn off assumptions of close-to-open caching
semantics when we detect that the application is using locking, and we
turn off optimisations such as assuming we can extend writes to page
boundaries when the page is marked as being up to date.

IOW: if all the clients are running Linux, then the thread that took
the lock should see 100% up to date data in the locked range. I believe
most (if not all) non-Linux clients use similar semantics when
taking/releasing byte range locks, so they too should be fine.

The only 2 issues I can think of offhand that might blow things up are:

1. The client thinks it holds a delegation when it does not (e.g.
because the delegation was revoked) causing it to assume it can
cache aggressively.
2. The change attribute on the server implementation is based on a
ctime with crappy resolution that causes the client to believe the
data has not changed on the server even though it has (a.k.a. 'ext3
syndrome').

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2020-06-18 20:10:11

by J. Bruce Fields

[permalink] [raw]
Subject: Re: client caching and locks

On Thu, Jun 18, 2020 at 02:29:42PM +0000, Trond Myklebust wrote:
> On Thu, 2020-06-18 at 09:54 +0000, [email protected] wrote:
> > > What does the client do to its cache when it writes to a locked
> > > range?
> > >
> > > The RFC:
> > >
> > > https://tools.ietf.org/html/rfc7530#section-10.3.2
> > >
> > > seems to apply that you should get something like local-filesystem
> > > semantics if you write-lock any range that you write to and read-
> > > lock
> > > any range that you read from.
> > >
> > > But I see a report that when applications write to non-overlapping
> > > ranges (while taking locks over those ranges), they don't see each
> > > other's updates.
> > >
> > > I think for simultaneous non-overlapping writes to work that way,
> > > the
> > > client would need to invalidate its cache on unlock (except for the
> > > locked range). But i can't tell what the client's designed to do.
> >
> > Simultaneous non-overlapping WRITEs is not taken into consideration
> > in RFC7530.
> > I personally think it is not necessary to deal with this case by
> > modifying the kernel because
> > the application on the client can be implemented to avoid it.
> >
> > Serialization of the simultaneous operations may be one of the ways.
> > Just before the write operation, each client locks and reads the
> > overlapped range of data
> > instead of obtaining a lock in their own non-overlapping range.
> > They can reflect updates from other clients in this case.
> >
> > Yuki Inoguchi
> >
> > > --b.
>
> See the function 'fs/nfs/file.c:do_setlk()'. We flush dirty file data
> both before and after taking the byte range lock. After taking the
> lock, we force a revalidation of the data before returning control to
> the application (unless there is a delegation that allows us to cache
> more aggressively).
>
> In addition, if you look at fs/nfs/file.c:do_unlk() you'll note that we
> force a flush of all dirty file data before releasing the lock.
>
> Finally, note that we turn off assumptions of close-to-open caching
> semantics when we detect that the application is using locking, and we
> turn off optimisations such as assuming we can extend writes to page
> boundaries when the page is marked as being up to date.
>
> IOW: if all the clients are running Linux, then the thread that took
> the lock should see 100% up to date data in the locked range. I believe
> most (if not all) non-Linux clients use similar semantics when
> taking/releasing byte range locks, so they too should be fine.

I probably don't understand the algorithm (in particular, how it
revalidates caches after a write).

How does it avoid a race like this?:

Start with a file whose data is all 0's and change attribute x:

client 0 client 1
-------- --------
take write lock on byte 0
take write lock on byte 1
write 1 to offset 0
change attribute now x+1
write 1 to offset 1
change attribute now x+2
getattr returns x+2
getattr returns x+2
unlock
unlock

take readlock on byte 1

At this point a getattr will return change attribute x+2, the same as
was returned after client 0's write. Does that mean client 0 assumes
the file data is unchanged since its last write?

--b.

2020-06-22 13:53:03

by J. Bruce Fields

[permalink] [raw]
Subject: Re: client caching and locks

On Thu, Jun 18, 2020 at 04:09:05PM -0400, [email protected] wrote:
> I probably don't understand the algorithm (in particular, how it
> revalidates caches after a write).
>
> How does it avoid a race like this?:
>
> Start with a file whose data is all 0's and change attribute x:
>
> client 0 client 1
> -------- --------
> take write lock on byte 0
> take write lock on byte 1
> write 1 to offset 0
> change attribute now x+1
> write 1 to offset 1
> change attribute now x+2
> getattr returns x+2
> getattr returns x+2
> unlock
> unlock
>
> take readlock on byte 1
>
> At this point a getattr will return change attribute x+2, the same as
> was returned after client 0's write. Does that mean client 0 assumes
> the file data is unchanged since its last write?

Basically: write-locking less than the whole range doesn't prevent
concurrent writes outside that range. And the change attribute gives us
no way to identify whether concurrent writes have happened. (At least,
not without NFS4_CHANGE_TYPE_IS_VERSION_COUNTER.)

So as far as I can tell, a client implementation has no reliable way to
revalidate its cache outside the write-locked area--instead it needs to
just throw out that part of the cache.

Possibly that's what it's doing and I just don't see it--I read through
some of the code and don't understand it yet.

--b.

2020-10-01 21:48:39

by J. Bruce Fields

[permalink] [raw]
Subject: Re: client caching and locks

On Mon, Jun 22, 2020 at 09:52:22AM -0400, [email protected] wrote:
> On Thu, Jun 18, 2020 at 04:09:05PM -0400, [email protected] wrote:
> > I probably don't understand the algorithm (in particular, how it
> > revalidates caches after a write).
> >
> > How does it avoid a race like this?:
> >
> > Start with a file whose data is all 0's and change attribute x:
> >
> > client 0 client 1
> > -------- --------
> > take write lock on byte 0
> > take write lock on byte 1
> > write 1 to offset 0
> > change attribute now x+1
> > write 1 to offset 1
> > change attribute now x+2
> > getattr returns x+2
> > getattr returns x+2
> > unlock
> > unlock
> >
> > take readlock on byte 1
> >
> > At this point a getattr will return change attribute x+2, the same as
> > was returned after client 0's write. Does that mean client 0 assumes
> > the file data is unchanged since its last write?
>
> Basically: write-locking less than the whole range doesn't prevent
> concurrent writes outside that range. And the change attribute gives us
> no way to identify whether concurrent writes have happened. (At least,
> not without NFS4_CHANGE_TYPE_IS_VERSION_COUNTER.)
>
> So as far as I can tell, a client implementation has no reliable way to
> revalidate its cache outside the write-locked area--instead it needs to
> just throw out that part of the cache.

Does my description of that race make sense?

--b.

2020-10-01 22:27:18

by Matt Benjamin

[permalink] [raw]
Subject: Re: client caching and locks

Hi Bruce,

I'm not sure. My understanding has been that, NFSv4 does not mandate
a mechanism to update clients of changes outside of any locked range.
In AFS (and I think DCE DFS?) this role is played by DataVersion. If
I recall correctly, David Noveck provided an errata that addresses
this, that servers could use in a similar manner to DV, but I don't
recall the details.

Matt

On Thu, Oct 1, 2020 at 5:48 PM [email protected]
<[email protected]> wrote:
>
> On Mon, Jun 22, 2020 at 09:52:22AM -0400, [email protected] wrote:
> > On Thu, Jun 18, 2020 at 04:09:05PM -0400, [email protected] wrote:
> > > I probably don't understand the algorithm (in particular, how it
> > > revalidates caches after a write).
> > >
> > > How does it avoid a race like this?:
> > >
> > > Start with a file whose data is all 0's and change attribute x:
> > >
> > > client 0 client 1
> > > -------- --------
> > > take write lock on byte 0
> > > take write lock on byte 1
> > > write 1 to offset 0
> > > change attribute now x+1
> > > write 1 to offset 1
> > > change attribute now x+2
> > > getattr returns x+2
> > > getattr returns x+2
> > > unlock
> > > unlock
> > >
> > > take readlock on byte 1
> > >
> > > At this point a getattr will return change attribute x+2, the same as
> > > was returned after client 0's write. Does that mean client 0 assumes
> > > the file data is unchanged since its last write?
> >
> > Basically: write-locking less than the whole range doesn't prevent
> > concurrent writes outside that range. And the change attribute gives us
> > no way to identify whether concurrent writes have happened. (At least,
> > not without NFS4_CHANGE_TYPE_IS_VERSION_COUNTER.)
> >
> > So as far as I can tell, a client implementation has no reliable way to
> > revalidate its cache outside the write-locked area--instead it needs to
> > just throw out that part of the cache.
>
> Does my description of that race make sense?
>
> --b.
>


--

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel. 734-821-5101
fax. 734-769-8938
cel. 734-216-5309

2020-10-06 17:26:55

by J. Bruce Fields

[permalink] [raw]
Subject: Re: client caching and locks

On Thu, Oct 01, 2020 at 06:26:25PM -0400, Matt Benjamin wrote:
> I'm not sure. My understanding has been that, NFSv4 does not mandate
> a mechanism to update clients of changes outside of any locked range.
> In AFS (and I think DCE DFS?) this role is played by DataVersion. If
> I recall correctly, David Noveck provided an errata that addresses
> this, that servers could use in a similar manner to DV, but I don't
> recall the details.

Maybe you're thinking of the change_attr_type that's new to 4.2? I
think that was Trond's proposal originally. In the
CHANGE_TYPE_IS_VERSION_COUNTER case it would in theory allow you to tell
whether a file that you'd written to was also written to by someone else
by counting WRITE operations.

But we still have to ensure consistency whether the server implements
that. (I doubt any server currently does.)

--b.

>
> Matt
>
> On Thu, Oct 1, 2020 at 5:48 PM [email protected]
> <[email protected]> wrote:
> >
> > On Mon, Jun 22, 2020 at 09:52:22AM -0400, [email protected] wrote:
> > > On Thu, Jun 18, 2020 at 04:09:05PM -0400, [email protected] wrote:
> > > > I probably don't understand the algorithm (in particular, how it
> > > > revalidates caches after a write).
> > > >
> > > > How does it avoid a race like this?:
> > > >
> > > > Start with a file whose data is all 0's and change attribute x:
> > > >
> > > > client 0 client 1
> > > > -------- --------
> > > > take write lock on byte 0
> > > > take write lock on byte 1
> > > > write 1 to offset 0
> > > > change attribute now x+1
> > > > write 1 to offset 1
> > > > change attribute now x+2
> > > > getattr returns x+2
> > > > getattr returns x+2
> > > > unlock
> > > > unlock
> > > >
> > > > take readlock on byte 1
> > > >
> > > > At this point a getattr will return change attribute x+2, the same as
> > > > was returned after client 0's write. Does that mean client 0 assumes
> > > > the file data is unchanged since its last write?
> > >
> > > Basically: write-locking less than the whole range doesn't prevent
> > > concurrent writes outside that range. And the change attribute gives us
> > > no way to identify whether concurrent writes have happened. (At least,
> > > not without NFS4_CHANGE_TYPE_IS_VERSION_COUNTER.)
> > >
> > > So as far as I can tell, a client implementation has no reliable way to
> > > revalidate its cache outside the write-locked area--instead it needs to
> > > just throw out that part of the cache.
> >
> > Does my description of that race make sense?
> >
> > --b.
> >
>
>
> --
>
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel. 734-821-5101
> fax. 734-769-8938
> cel. 734-216-5309

2021-12-28 02:41:09

by [email protected]

[permalink] [raw]
Subject: RE: client caching and locks

Hi,

Sorry to resurrect this old thread, but I wonder how NFS clients should behave.

I'm seeing this behavior when I run a test program using Open MPI. In the test program,
two clients acquire locks at each write location. Then they simultaneously write
data to the same file in NFS.

In other words, the program does just like Bruce explained previously:

> > > > > client 0 client 1
> > > > > -------- --------
> > > > > take write lock on byte 0
> > > > > take write lock on byte 1
> > > > > write 1 to offset 0
> > > > > change attribute now x+1
> > > > > write 1 to offset 1
> > > > > change attribute now x+2
> > > > > getattr returns x+2
> > > > > getattr returns x+2
> > > > > unlock
> > > > > unlock
> > > > >
> > > > > take readlock on byte 1

In my test,
- The file data is zero-filled before the write.
- client 0 acquires a write lock at offset 0 and writes 1 to it.
- client 1 acquires a write lock at offset 4 and writes 2 to it.

After the test, sometimes I'm seeing the following result. A client doesn't reflect the other's update.
-----
- client 0:
[user@client0 nfs]$ od -i data
0000000 1 2
0000010

- client 1:
[user@client1 nfs]# od -i data
0000000 0 2
0000010

- NFS server:
[user@server nfs]# od -i data
0000000 1 2
0000010
-----

This happens because client 1 receives GETATTR reply after both clients' writes completed.
Therefore, client 1 assumes the file data is unchanged since its last write.

For the detail, please see the following analysis of tcpdump collected from NFS server:
-----
IP addresses are as follows:
- client 0: 192.168.122.158
- client 1: 192.168.122.244
- server: 192.168.122.12

1. client 0 and 1 called OPEN, LOCK and WRITE to write values on each offset of the file simultaneously:

165587 2021-12-27 19:08:26.792438 192.168.122.244 $B"*(B 192.168.122.12 NFS 354 V4 Call OPEN DH: 0xc1b3a552/data
165589 2021-12-27 19:08:26.801025 192.168.122.12 $B"*(B 192.168.122.244 NFS 430 V4 Reply (Call In 165587) OPEN StateID: 0x9357
165592 2021-12-27 19:08:26.802125 192.168.122.158 $B"*(B 192.168.122.12 NFS 322 V4 Call OPEN DH: 0xc1b3a552/data
165593 2021-12-27 19:08:26.802367 192.168.122.12 $B"*(B 192.168.122.158 NFS 438 V4 Reply (Call In 165592) OPEN StateID: 0xde4c
165595 2021-12-27 19:08:26.807853 192.168.122.158 $B"*(B 192.168.122.12 NFS 326 V4 Call LOCK FH: 0x4cdb3daa Offset: 0 Length: 4
165596 2021-12-27 19:08:26.807879 192.168.122.244 $B"*(B 192.168.122.12 NFS 326 V4 Call LOCK FH: 0x4cdb3daa Offset: 4 Length: 4
165597 2021-12-27 19:08:26.807983 192.168.122.12 $B"*(B 192.168.122.158 NFS 182 V4 Reply (Call In 165595) LOCK
165598 2021-12-27 19:08:26.808047 192.168.122.12 $B"*(B 192.168.122.244 NFS 182 V4 Reply (Call In 165596) LOCK
165600 2021-12-27 19:08:26.808473 192.168.122.158 $B"*(B 192.168.122.12 NFS 294 V4 Call WRITE StateID: 0x8cc0 Offset: 0 Len: 4
165602 2021-12-27 19:08:26.809058 192.168.122.244 $B"*(B 192.168.122.12 NFS 294 V4 Call WRITE StateID: 0x8a41 Offset: 4 Len: 4

2. client 0 received WRITE reply earlier than client 1 so it called LOCKU and CLOSE.

165607 2021-12-27 19:08:26.843312 192.168.122.12 $B"*(B 192.168.122.158 NFS 334 V4 Reply (Call In 165600) WRITE
165608 2021-12-27 19:08:26.844218 192.168.122.158 $B"*(B 192.168.122.12 NFS 282 V4 Call LOCKU FH: 0x4cdb3daa Offset: 0 Length: 4
165609 2021-12-27 19:08:26.844320 192.168.122.12 $B"*(B 192.168.122.158 NFS 182 V4 Reply (Call In 165608) LOCKU
165611 2021-12-27 19:08:26.845007 192.168.122.158 $B"*(B 192.168.122.12 NFS 278 V4 Call CLOSE StateID: 0xde4c
165613 2021-12-27 19:08:26.845230 192.168.122.12 $B"*(B 192.168.122.158 NFS 334 V4 Reply (Call In 165611) CLOSE

At the frame 165607, the file's changeid was 1761580521582393257.

# tshark -r repro.cap -V "frame.number==165607" | grep -e "changeid:" -e Ops
Network File System, Ops(4): SEQUENCE PUTFH WRITE GETATTR
changeid: 1761580521582393257

3. client 0 called OPEN again while client 1 was still waiting for WRITE reply.

165615 2021-12-27 19:08:26.845652 192.168.122.158 $B"*(B 192.168.122.12 NFS 322 V4 Call OPEN DH: 0x4cdb3daa/
165616 2021-12-27 19:08:26.847702 192.168.122.12 $B"*(B 192.168.122.158 NFS 386 V4 Reply (Call In 165615) OPEN StateID: 0x95d6

At the frame 165616, the file's changeid was incremented to 1761580521582393258 by the server.

# tshark -r repro.cap -V "frame.number==165616" | grep -e "changeid:" -e Ops
Network File System, Ops(5): SEQUENCE PUTFH OPEN ACCESS GETATTR
changeid: 1761580521582393258

Therefore, client 0 called READ and reflected updates from other client.

165617 2021-12-27 19:08:26.848454 192.168.122.158 $B"*(B 192.168.122.12 NFS 270 V4 Call READ StateID: 0x907b Offset: 0 Len: 8
165618 2021-12-27 19:08:26.848572 192.168.122.12 $B"*(B 192.168.122.158 NFS 182 V4 Reply (Call In 165617) READ
165619 2021-12-27 19:08:26.849096 192.168.122.158 $B"*(B 192.168.122.12 NFS 278 V4 Call CLOSE StateID: 0x95d6
165620 2021-12-27 19:08:26.849179 192.168.122.12 $B"*(B 192.168.122.158 NFS 334 V4 Reply (Call In 165619) CLOSE

4. client 1 finally received WRITE reply and called LOCKU and CLOSE.

165622 2021-12-27 19:08:26.855130 192.168.122.12 $B"*(B 192.168.122.244 NFS 334 V4 Reply (Call In 165602) WRITE
165623 2021-12-27 19:08:26.855965 192.168.122.244 $B"*(B 192.168.122.12 NFS 282 V4 Call LOCKU FH: 0x4cdb3daa Offset: 4 Length: 4
165625 2021-12-27 19:08:26.856094 192.168.122.12 $B"*(B 192.168.122.244 NFS 182 V4 Reply (Call In 165623) LOCKU
165627 2021-12-27 19:08:26.856647 192.168.122.244 $B"*(B 192.168.122.12 NFS 278 V4 Call CLOSE StateID: 0x9357
165629 2021-12-27 19:08:26.856784 192.168.122.12 $B"*(B 192.168.122.244 NFS 334 V4 Reply (Call In 165627) CLOSE

At the frame 165622, changeid was _not_ incremented.

# tshark -r repro.cap -V "frame.number==165622" | grep -e "changeid:" -e Ops
Network File System, Ops(4): SEQUENCE PUTFH WRITE GETATTR
changeid: 1761580521582393258

5. client 1 called OPEN again but ...

165635 2021-12-27 19:08:26.858006 192.168.122.244 $B"*(B 192.168.122.12 NFS 322 V4 Call OPEN DH: 0x4cdb3daa/
165636 2021-12-27 19:08:26.859538 192.168.122.12 $B"*(B 192.168.122.244 NFS 386 V4 Reply (Call In 165635) OPEN StateID: 0x3d15

... since no further changes were made to the file, the changeid wasn't updated.

# tshark -r repro.cap -V "frame.number==165636" | grep -e "changeid:" -e Ops
Network File System, Ops(5): SEQUENCE PUTFH OPEN ACCESS GETATTR
changeid: 1761580521582393258

6. client 1 assumed the file data was unchanged since its last write. It called CLOSE without calling READ.

165637 2021-12-27 19:08:26.860201 192.168.122.244 $B"*(B 192.168.122.12 NFS 278 V4 Call CLOSE StateID: 0x3d15
165638 2021-12-27 19:08:26.860296 192.168.122.12 $B"*(B 192.168.122.244 NFS 334 V4 Reply (Call In 165637) CLOSE
-----

So current implementation of NFS client doesn't assure reflecting other clients' update
when it is written simultaneously in non-overlapping ranges. Because it isn't assured
in RFC 7530 either, I think this is not a bug.

However, if it will be implemented, what approaches are affordable?

If a client can invalidate whole cache of the file on unlock, each client sees other's update
written in non-overlapping ranges. I verified it with the following changes in
fs/nfs/file.c:do_unlk().
----------
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -746,6 +746,14 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
status = NFS_PROTO(inode)->lock(filp, cmd, fl);
else
status = locks_lock_file_wait(filp, fl);
+
+ nfs_sync_mapping(filp->f_mapping);
+ if (!NFS_PROTO(inode)->have_delegation(inode, FMODE_READ)) {
+ nfs_zap_caches(inode);
+ if (mapping_mapped(filp->f_mapping))
+ nfs_revalidate_mapping(inode, filp->f_mapping);
+ }
+
return status;
}
----------
But I feel this approach is redundant since client invalidates its cache on lock in fs/nfs/file.c:do_setlk().
Also, the above change may cause a performance degradation. Are there any other approach
we can take? Or, doesn't NFS have to implement it because it's not a bug ?

Yuki Inoguchi

> -----Original Message-----
> From: [email protected] <[email protected]>
> Sent: Wednesday, October 7, 2020 2:26 AM
> To: Matt Benjamin <[email protected]>
> Cc: Trond Myklebust <[email protected]>; Inoguchi, Yuki/$B0f%N8}(B
> $BM:@8(B <[email protected]>; [email protected]
> Subject: Re: client caching and locks
>
> On Thu, Oct 01, 2020 at 06:26:25PM -0400, Matt Benjamin wrote:
> > I'm not sure. My understanding has been that, NFSv4 does not mandate
> > a mechanism to update clients of changes outside of any locked range.
> > In AFS (and I think DCE DFS?) this role is played by DataVersion. If
> > I recall correctly, David Noveck provided an errata that addresses
> > this, that servers could use in a similar manner to DV, but I don't
> > recall the details.
>
> Maybe you're thinking of the change_attr_type that's new to 4.2? I
> think that was Trond's proposal originally. In the
> CHANGE_TYPE_IS_VERSION_COUNTER case it would in theory allow you to
> tell
> whether a file that you'd written to was also written to by someone else
> by counting WRITE operations.
>
> But we still have to ensure consistency whether the server implements
> that. (I doubt any server currently does.)
>
> --b.
>
> >
> > Matt
> >
> > On Thu, Oct 1, 2020 at 5:48 PM [email protected]
> > <[email protected]> wrote:
> > >
> > > On Mon, Jun 22, 2020 at 09:52:22AM -0400, [email protected] wrote:
> > > > On Thu, Jun 18, 2020 at 04:09:05PM -0400, [email protected] wrote:
> > > > > I probably don't understand the algorithm (in particular, how it
> > > > > revalidates caches after a write).
> > > > >
> > > > > How does it avoid a race like this?:
> > > > >
> > > > > Start with a file whose data is all 0's and change attribute x:
> > > > >
> > > > > client 0 client 1
> > > > > -------- --------
> > > > > take write lock on byte 0
> > > > > take write lock on byte 1
> > > > > write 1 to offset 0
> > > > > change attribute now x+1
> > > > > write 1 to offset 1
> > > > > change attribute now
> x+2
> > > > > getattr returns x+2
> > > > > getattr returns x+2
> > > > > unlock
> > > > > unlock
> > > > >
> > > > > take readlock on byte 1
> > > > >
> > > > > At this point a getattr will return change attribute x+2, the same as
> > > > > was returned after client 0's write. Does that mean client 0 assumes
> > > > > the file data is unchanged since its last write?
> > > >
> > > > Basically: write-locking less than the whole range doesn't prevent
> > > > concurrent writes outside that range. And the change attribute gives us
> > > > no way to identify whether concurrent writes have happened. (At least,
> > > > not without NFS4_CHANGE_TYPE_IS_VERSION_COUNTER.)
> > > >
> > > > So as far as I can tell, a client implementation has no reliable way to
> > > > revalidate its cache outside the write-locked area--instead it needs to
> > > > just throw out that part of the cache.
> > >
> > > Does my description of that race make sense?
> > >
> > > --b.
> > >
> >
> >
> > --
> >
> > Matt Benjamin
> > Red Hat, Inc.
> > 315 West Huron Street, Suite 140A
> > Ann Arbor, Michigan 48103
> >
> > http://www.redhat.com/en/technologies/storage
> >
> > tel. 734-821-5101
> > fax. 734-769-8938
> > cel. 734-216-5309

2021-12-28 05:11:59

by NeilBrown

[permalink] [raw]
Subject: RE: client caching and locks

On Tue, 28 Dec 2021, [email protected] wrote:
> Hi,
>
> Sorry to resurrect this old thread, but I wonder how NFS clients should behave.
>
> I'm seeing this behavior when I run a test program using Open MPI. In the test program,
> two clients acquire locks at each write location. Then they simultaneously write
> data to the same file in NFS.
>
> In other words, the program does just like Bruce explained previously:
>
> > > > > > client 0 client 1
> > > > > > -------- --------
> > > > > > take write lock on byte 0
> > > > > > take write lock on byte 1
> > > > > > write 1 to offset 0
> > > > > > change attribute now x+1
> > > > > > write 1 to offset 1
> > > > > > change attribute now x+2
> > > > > > getattr returns x+2
> > > > > > getattr returns x+2
> > > > > > unlock
> > > > > > unlock
> > > > > >
> > > > > > take readlock on byte 1
>
> In my test,
> - The file data is zero-filled before the write.
> - client 0 acquires a write lock at offset 0 and writes 1 to it.
> - client 1 acquires a write lock at offset 4 and writes 2 to it.
>
> After the test, sometimes I'm seeing the following result. A client doesn't reflect the other's update.
> -----
> - client 0:
> [user@client0 nfs]$ od -i data
> 0000000 1 2
> 0000010
>
> - client 1:
> [user@client1 nfs]# od -i data
> 0000000 0 2
> 0000010
>
> - NFS server:
> [user@server nfs]# od -i data
> 0000000 1 2
> 0000010
> -----
>
> This happens because client 1 receives GETATTR reply after both clients' writes completed.
> Therefore, client 1 assumes the file data is unchanged since its last write.

This is due to an (arguable) weakness in the NFSv4 protocol.
In NFSv3 the reply to the WRITE request had "wcc" data which would
report change information before and after the write and, if present, it
was required to be atomic. So, providing timestamps had a high
resolution, the client0 would see change information from *before* the
write from client1 completed. So it would know it needed to refresh
after that write became visible.

With NFSv4 there is no atomicity guarantees relating to writes and
changeid.
There is provision for atomicity around directory operations, but not
around data operations.

This means that if different clients access a file concurrently, then
their cache can become incorrect. The only way to ensure uncorrupted
data is to use locking for ALL reads and writes. The above 'od -i' does
not perform a locked read, so can give incorrect data.
If you got a whole-file lock before reading, then you should get correct
data.
You could argue that this requirement (always lock if there is any risk)
is by design, and so this aspect of the protocl is not a weakness.

NeilBrown

2022-01-03 16:20:45

by J. Bruce Fields

[permalink] [raw]
Subject: Re: client caching and locks

On Tue, Dec 28, 2021 at 04:11:51PM +1100, NeilBrown wrote:
> This is due to an (arguable) weakness in the NFSv4 protocol.
> In NFSv3 the reply to the WRITE request had "wcc" data which would
> report change information before and after the write and, if present, it
> was required to be atomic. So, providing timestamps had a high
> resolution, the client0 would see change information from *before* the
> write from client1 completed. So it would know it needed to refresh
> after that write became visible.
>
> With NFSv4 there is no atomicity guarantees relating to writes and
> changeid.
> There is provision for atomicity around directory operations, but not
> around data operations.
>
> This means that if different clients access a file concurrently, then
> their cache can become incorrect. The only way to ensure uncorrupted
> data is to use locking for ALL reads and writes. The above 'od -i' does
> not perform a locked read, so can give incorrect data.
> If you got a whole-file lock before reading, then you should get correct
> data.

You'd also have to get a whole-file write lock on every write, wouldn't
you, to prevent your own write from obscuring the change-attribute
update caused by a concurrent writer?

> You could argue that this requirement (always lock if there is any risk)
> is by design, and so this aspect of the protocl is not a weakness.

The spec discussion of byte-range locking and caching is here:
https://datatracker.ietf.org/doc/html/rfc7530#section-10.3.2

The nfs man page, under ac/noac, says "Using the noac option provides
greater cache coherence among NFS clients accessing the same files,
but it extracts a significant performance penalty. As such, judicious
use of file locking is encouraged instead. The DATA AND METADATA
COHERENCE section contains a detailed discussion of these trade-offs."

That section does have a "Using file locks with NFS" subsection, but
that subsection doesn't actually discuss the interaction of file locks
with client caching.

It's a confusing and under-documented area.

--b.

2022-01-04 09:33:33

by [email protected]

[permalink] [raw]
Subject: RE: client caching and locks

Hi Neil and Bruce,

Thank you for your comments.
Now I have understood that this behavior is by design.

> > With NFSv4 there is no atomicity guarantees relating to writes and
> > changeid.
> > There is provision for atomicity around directory operations, but not
> > around data operations.

So I feel like clients cannot always trust changeid in NFSv4.
Should it be described in the spec?

For example, the following section refers about the usage of changeid:
https://datatracker.ietf.org/doc/html/draft-dnoveck-nfsv4-rfc5661bis#section-14.3.1

It says clients use change attribute " to ensure that the data for the OPENed file is still
correctly reflected in the client's cache." But in fact, it could be wrong.

Yuki Inoguchi

> -----Original Message-----
> From: '[email protected]' <[email protected]>
> Sent: Tuesday, January 4, 2022 1:21 AM
> To: NeilBrown <[email protected]>
> Cc: Inoguchi, Yuki/$B0f%N8}(B $BM:@8(B <[email protected]>; 'Matt Benjamin'
> <[email protected]>; 'Trond Myklebust' <[email protected]>;
> '[email protected]' <[email protected]>
> Subject: Re: client caching and locks
>
> On Tue, Dec 28, 2021 at 04:11:51PM +1100, NeilBrown wrote:
> > This is due to an (arguable) weakness in the NFSv4 protocol.
> > In NFSv3 the reply to the WRITE request had "wcc" data which would
> > report change information before and after the write and, if present, it
> > was required to be atomic. So, providing timestamps had a high
> > resolution, the client0 would see change information from *before* the
> > write from client1 completed. So it would know it needed to refresh
> > after that write became visible.
> >
> > With NFSv4 there is no atomicity guarantees relating to writes and
> > changeid.
> > There is provision for atomicity around directory operations, but not
> > around data operations.
> >
> > This means that if different clients access a file concurrently, then
> > their cache can become incorrect. The only way to ensure uncorrupted
> > data is to use locking for ALL reads and writes. The above 'od -i' does
> > not perform a locked read, so can give incorrect data.
> > If you got a whole-file lock before reading, then you should get correct
> > data.
>
> You'd also have to get a whole-file write lock on every write, wouldn't
> you, to prevent your own write from obscuring the change-attribute
> update caused by a concurrent writer?
>
> > You could argue that this requirement (always lock if there is any risk)
> > is by design, and so this aspect of the protocl is not a weakness.
>
> The spec discussion of byte-range locking and caching is here:
> https://datatracker.ietf.org/doc/html/rfc7530#section-10.3.2
>
> The nfs man page, under ac/noac, says "Using the noac option provides
> greater cache coherence among NFS clients accessing the same files,
> but it extracts a significant performance penalty. As such, judicious
> use of file locking is encouraged instead. The DATA AND METADATA
> COHERENCE section contains a detailed discussion of these trade-offs."
>
> That section does have a "Using file locks with NFS" subsection, but
> that subsection doesn't actually discuss the interaction of file locks
> with client caching.
>
> It's a confusing and under-documented area.
>
> --b.

2022-01-04 12:36:18

by Trond Myklebust

[permalink] [raw]
Subject: Re: client caching and locks

On Tue, 2022-01-04 at 09:24 +0000, [email protected] wrote:
> Hi Neil and Bruce,
>
> Thank you for your comments.
> Now I have understood that this behavior is by design.
>
> > > With NFSv4 there is no atomicity guarantees relating to writes
> > > and
> > > changeid.
> > > There is provision for atomicity around directory operations, but
> > > not
> > > around data operations.
>
> So I feel like clients cannot always trust changeid in NFSv4.
> Should it be described in the spec?
>
> For example, the following section refers about the usage of
> changeid:
> https://datatracker.ietf.org/doc/html/draft-dnoveck-nfsv4-rfc5661bis#section-14.3.1
>
> It says clients use change attribute " to ensure that the data for
> the OPENed file is still
> correctly reflected in the client's cache." But in fact, it could be
> wrong.
>
> Yuki Inoguchi

The existence of buggy servers is not a problem for the client to deal
with. It's a problem for the server vendors to fix.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2022-01-04 15:32:08

by J. Bruce Fields

[permalink] [raw]
Subject: Re: client caching and locks

On Tue, Jan 04, 2022 at 12:36:14PM +0000, Trond Myklebust wrote:
> On Tue, 2022-01-04 at 09:24 +0000, [email protected] wrote:
> > Hi Neil and Bruce,
> >
> > Thank you for your comments.
> > Now I have understood that this behavior is by design.
> >
> > > > With NFSv4 there is no atomicity guarantees relating to writes
> > > > and
> > > > changeid.
> > > > There is provision for atomicity around directory operations, but
> > > > not
> > > > around data operations.
> >
> > So I feel like clients cannot always trust changeid in NFSv4.
> > Should it be described in the spec?
> >
> > For example, the following section refers about the usage of
> > changeid:
> > https://datatracker.ietf.org/doc/html/draft-dnoveck-nfsv4-rfc5661bis#section-14.3.1
> >
> > It says clients use change attribute " to ensure that the data for
> > the OPENed file is still
> > correctly reflected in the client's cache." But in fact, it could be
> > wrong.
>
> The existence of buggy servers is not a problem for the client to deal
> with. It's a problem for the server vendors to fix.

I agree that, in the absence of bugs, there's no problem with using the
change attribute to provide close-to-open cache consistency.

The interesting question to me is how you use locks to provide cache
consistency.

Language like that in
https://datatracker.ietf.org/doc/html/rfc7530#section-10.3.2 implies
that you can get something like local cache consistency by write-locking
the ranges you write, read-locking the ranges you read, flushing before
write unlocks, and checking change attributes before read locks.

In fact, that doesn't guarantee that readers see previous writes.

--b.

2022-01-04 15:56:37

by Trond Myklebust

[permalink] [raw]
Subject: Re: client caching and locks

On Tue, 2022-01-04 at 10:32 -0500, [email protected] wrote:
> On Tue, Jan 04, 2022 at 12:36:14PM +0000, Trond Myklebust wrote:
> > On Tue, 2022-01-04 at 09:24 +0000, [email protected] wrote:
> > > Hi Neil and Bruce,
> > >
> > > Thank you for your comments.
> > > Now I have understood that this behavior is by design.
> > >
> > > > > With NFSv4 there is no atomicity guarantees relating to
> > > > > writes
> > > > > and
> > > > > changeid.
> > > > > There is provision for atomicity around directory operations,
> > > > > but
> > > > > not
> > > > > around data operations.
> > >
> > > So I feel like clients cannot always trust changeid in NFSv4.
> > > Should it be described in the spec?
> > >
> > > For example, the following section refers about the usage of
> > > changeid:
> > > https://datatracker.ietf.org/doc/html/draft-dnoveck-nfsv4-rfc5661bis#section-14.3.1
> > >
> > > It says clients use change attribute " to ensure that the data
> > > for
> > > the OPENed file is still
> > > correctly reflected in the client's cache." But in fact, it could
> > > be
> > > wrong.
> >
> > The existence of buggy servers is not a problem for the client to
> > deal
> > with. It's a problem for the server vendors to fix.
>
> I agree that, in the absence of bugs, there's no problem with using
> the
> change attribute to provide close-to-open cache consistency.
>
> The interesting question to me is how you use locks to provide cache
> consistency.
>
> Language like that in
> https://datatracker.ietf.org/doc/html/rfc7530#section-10.3.2 implies
> that you can get something like local cache consistency by write-
> locking
> the ranges you write, read-locking the ranges you read, flushing
> before
> write unlocks, and checking change attributes before read locks.
>
> In fact, that doesn't guarantee that readers see previous writes.
>

It does if you are doing full file locking. I agree it will not, if you
are doing partial file locking.

IOW: If two clients can potentially both write to different parts of
the same file at the same time, then that the change attribute is
insufficient to determine whether or not that was indeed what happened.

However if only one client can write to the file at any time, then the
change attribute check should be sufficient, given a NFSv4 spec
compliant server.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2022-01-05 09:33:15

by [email protected]

[permalink] [raw]
Subject: RE: client caching and locks

> It does if you are doing full file locking. I agree it will not, if you
> are doing partial file locking.
>
> IOW: If two clients can potentially both write to different parts of
> the same file at the same time, then that the change attribute is
> insufficient to determine whether or not that was indeed what happened.

I have understood. So for cache consistency, full file locking is needed if
multiple clients can write the different parts of the same file concurrently.

I think this kind of information should be documented in somewhere.
If it is enough to focus on the file locking, I'm assuming it to be under "DATA AND METADATA COHERENCE"
section in the nfs man page.

Or, maybe it is better to update section 10.3.2 in RFC7530 with more information such as cases
in which change attributes can and cannot guarantee cache consistency.

Could you please tell me your opinion?

Yuki Inoguchi

> -----Original Message-----
> From: Trond Myklebust <[email protected]>
> Sent: Wednesday, January 5, 2022 12:55 AM
> To: [email protected]
> Cc: Inoguchi, Yuki/井ノ口 雄生 <[email protected]>;
> [email protected]; [email protected]; [email protected]
> Subject: Re: client caching and locks
>
> On Tue, 2022-01-04 at 10:32 -0500, [email protected] wrote:
> > On Tue, Jan 04, 2022 at 12:36:14PM +0000, Trond Myklebust wrote:
> > > On Tue, 2022-01-04 at 09:24 +0000, [email protected] wrote:
> > > > Hi Neil and Bruce,
> > > >
> > > > Thank you for your comments.
> > > > Now I have understood that this behavior is by design.
> > > >
> > > > > > With NFSv4 there is no atomicity guarantees relating to
> > > > > > writes
> > > > > > and
> > > > > > changeid.
> > > > > > There is provision for atomicity around directory operations,
> > > > > > but
> > > > > > not
> > > > > > around data operations.
> > > >
> > > > So I feel like clients cannot always trust changeid in NFSv4.
> > > > Should it be described in the spec?
> > > >
> > > > For example, the following section refers about the usage of
> > > > changeid:
> > > >
> https://datatracker.ietf.org/doc/html/draft-dnoveck-nfsv4-rfc5661bis#section
> -14.3.1
> > > >
> > > > It says clients use change attribute " to ensure that the data
> > > > for
> > > > the OPENed file is still
> > > > correctly reflected in the client's cache." But in fact, it could
> > > > be
> > > > wrong.
> > >
> > > The existence of buggy servers is not a problem for the client to
> > > deal
> > > with. It's a problem for the server vendors to fix.
> >
> > I agree that, in the absence of bugs, there's no problem with using
> > the
> > change attribute to provide close-to-open cache consistency.
> >
> > The interesting question to me is how you use locks to provide cache
> > consistency.
> >
> > Language like that in
> > https://datatracker.ietf.org/doc/html/rfc7530#section-10.3.2 implies
> > that you can get something like local cache consistency by write-
> > locking
> > the ranges you write, read-locking the ranges you read, flushing
> > before
> > write unlocks, and checking change attributes before read locks.
> >
> > In fact, that doesn't guarantee that readers see previous writes.
> >
>
> It does if you are doing full file locking. I agree it will not, if you
> are doing partial file locking.
>
> IOW: If two clients can potentially both write to different parts of
> the same file at the same time, then that the change attribute is
> insufficient to determine whether or not that was indeed what happened.
>
> However if only one client can write to the file at any time, then the
> change attribute check should be sufficient, given a NFSv4 spec
> compliant server.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>

2022-01-05 22:04:02

by J. Bruce Fields

[permalink] [raw]
Subject: Re: client caching and locks

On Wed, Jan 05, 2022 at 09:31:59AM +0000, [email protected] wrote:
> I have understood. So for cache consistency, full file locking is needed if
> multiple clients can write the different parts of the same file concurrently.
>
> I think this kind of information should be documented in somewhere.
> If it is enough to focus on the file locking, I'm assuming it to be under "DATA AND METADATA COHERENCE"
> section in the nfs man page.

That subsection is kind of outdated. It leads with a discussion of the
(increasingly less relevant) NLM and NSM protocols, and despite being a
subsection of the "DATA AND METADTA COHERENCE" section, never gets
around to talking about that.

It also makes it sound like "nolock" only affects NLM, which I don't
think is right.

How about this? I also updated the lock/nolock description and deleted
the end of this subsection since it's redundant with that. And removed
the bit about using nolock to mount /var with v2/v3 as that seems like a
bit of a niche case at this point. If we still want to document that, I
think it belongs elsewhere.

--b.

diff --git a/utils/mount/nfs.man b/utils/mount/nfs.man
index 3bc18e1b30a9..7db043202fcf 100644
--- a/utils/mount/nfs.man
+++ b/utils/mount/nfs.man
@@ -722,10 +722,10 @@ reports the proper maximum component length to applications
in such cases.
.TP 1.5i
.BR lock " / " nolock
-Selects whether to use the NLM sideband protocol to lock files on the server.
+Selects whether to lock files on the server.
If neither option is specified (or if
.B lock
-is specified), NLM locking is used for this mount point.
+is specified), locks are taken on the server.
When using the
.B nolock
option, applications can lock files,
@@ -733,16 +733,9 @@ but such locks provide exclusion only against other applications
running on the same client.
Remote applications are not affected by these locks.
.IP
-NLM locking must be disabled with the
-.B nolock
-option when using NFS to mount
-.I /var
-because
-.I /var
-contains files used by the NLM implementation on Linux.
-Using the
+The
.B nolock
-option is also required when mounting exports on NFS servers
+option is required when using NFSv2 or NFSv3 to mount servers
that do not support the NLM protocol.
.TP 1.5i
.BR cto " / " nocto
@@ -1486,47 +1479,40 @@ the use of the
.B sync
mount option.
.SS "Using file locks with NFS"
-The Network Lock Manager protocol is a separate sideband protocol
-used to manage file locks in NFS version 2 and version 3.
-To support lock recovery after a client or server reboot,
-a second sideband protocol --
-known as the Network Status Manager protocol --
-is also required.
-In NFS version 4,
-file locking is supported directly in the main NFS protocol,
-and the NLM and NSM sideband protocols are not used.
+The nfs filesystem supports advisory byte-range locks acquired with
+.BR fcntl (2) .
+Locks obtained by
+.BR flock (2)
+are implemented as
+.BR fcntl (2)
+locks.
.P
-In most cases, NLM and NSM services are started automatically,
-and no extra configuration is required.
-Configure all NFS clients with fully-qualified domain names
-to ensure that NFS servers can find clients to notify them of server reboots.
+Locking can also provide cache consistency:
.P
-NLM supports advisory file locks only.
-To lock NFS files, use
-.BR fcntl (2)
-with the F_GETLK and F_SETLK commands.
-The NFS client converts file locks obtained via
-.BR flock (2)
-to advisory locks.
+Before acquiring a file lock, the client revalidates its cached data for
+the file. Before releasing a write lock, the client flushes to the
+server's stable storage any data in the locked range.
.P
-When mounting servers that do not support the NLM protocol,
-or when mounting an NFS server through a firewall
-that blocks the NLM service port,
-specify the
-.B nolock
-mount option. NLM locking must be disabled with the
-.B nolock
-option when using NFS to mount
-.I /var
-because
-.I /var
-contains files used by the NLM implementation on Linux.
+A distributed application running on multiple NFS clients can take a
+read lock for each range that it reads and a write lock for each range that
+it writes. On its own, however, that is insufficient to ensure that
+reads get up-to-date data.
.P
-Specifying the
-.B nolock
-option may also be advised to improve the performance
-of a proprietary application which runs on a single client
-and uses file locks extensively.
+When revalidating caches, the client is unable to reliably determine the
+difference between changes made by other clients and changes it made
+itself. Therefore, such an application would also need to prevent
+concurrent writes from multiple clients, either by taking whole-file
+locks on every write or by some other method.
+.P
+The protocol used for file locking differs between version. In versions
+before NFSv4, locks are implemented using the Network Lock Manager and
+Network Status Manager protocols. In versions since NFSv4, file locking
+is supported directly in the main NFS protocol.
+.P
+In most cases, NLM and NSM services are started automatically,
+and no extra configuration is required. NFSv2 and NFSv3 clients should
+be configured with fully-qualified domain names
+to ensure that NFS servers can find clients to notify them of server reboots.
.SS "NFS version 4 caching features"
The data and metadata caching behavior of NFS version 4
clients is similar to that of earlier versions.

2022-01-06 07:31:59

by [email protected]

[permalink] [raw]
Subject: RE: client caching and locks

> How about this? I also updated the lock/nolock description and deleted
> the end of this subsection since it's redundant with that. And removed
> the bit about using nolock to mount /var with v2/v3 as that seems like a
> bit of a niche case at this point. If we still want to document that, I
> think it belongs elsewhere.

Thank you so much for creating the patch!

For the most part I agree with you, but I feel unsafe to remove the part
"using nolock to mount /var with v2/v3" even if it seems niche case.
I'm also not sure if there is another suitable document to migrate it.

Therefore, at the end of "lock/nolock" subsection ...

> @@ -733,16 +733,9 @@ but such locks provide exclusion only against other
> applications
> running on the same client.
> Remote applications are not affected by these locks.
> .IP
> -NLM locking must be disabled with the
> -.B nolock
> -option when using NFS to mount
> -.I /var
> -because
> -.I /var
> -contains files used by the NLM implementation on Linux.
> -Using the
> +The
> .B nolock
> -option is also required when mounting exports on NFS servers
> +option is required when using NFSv2 or NFSv3 to mount servers
> that do not support the NLM protocol.
> .TP 1.5i
> .BR cto " / " nocto

... can we keep the description like this ?
-----
@@ -733,17 +733,14 @@ but such locks provide exclusion only against other applications
running on the same client.
Remote applications are not affected by these locks.
.IP
-NLM locking must be disabled with the
+When using NFSv2 or NFSv3, the
.B nolock
-option when using NFS to mount
-.I /var
-because
+option is required to mount servers that do not support the NLM protocol,
+or to mount
.I /var
+because
+.I /var
contains files used by the NLM implementation on Linux.
-Using the
-.B nolock
-option is also required when mounting exports on NFS servers
-that do not support the NLM protocol.
.TP 1.5i
.BR cto " / " nocto
Selects whether to use close-to-open cache coherence semantics.
-----
Yuki Inoguchi

2022-01-06 14:16:30

by J. Bruce Fields

[permalink] [raw]
Subject: Re: client caching and locks

On Thu, Jan 06, 2022 at 07:23:16AM +0000, [email protected] wrote:
> > How about this? I also updated the lock/nolock description and deleted
> > the end of this subsection since it's redundant with that. And removed
> > the bit about using nolock to mount /var with v2/v3 as that seems like a
> > bit of a niche case at this point. If we still want to document that, I
> > think it belongs elsewhere.
>
> Thank you so much for creating the patch!
>
> For the most part I agree with you, but I feel unsafe to remove the part
> "using nolock to mount /var with v2/v3" even if it seems niche case.
> I'm also not sure if there is another suitable document to migrate it.
>
> Therefore, at the end of "lock/nolock" subsection ...

I could live with that.

Though the other reason I cut it was because I think it needs updates
too and I wasn't sure exactly how to handle them.

The v4 case is more important and should probably be dealt with first.
I think the answer there is just "don't mount /var over NFSv4", period.

And maybe we should be more specific: the problem is with /var/lib/nfs,
not all of /var.

--b.

>
> > @@ -733,16 +733,9 @@ but such locks provide exclusion only against other
> > applications
> > running on the same client.
> > Remote applications are not affected by these locks.
> > .IP
> > -NLM locking must be disabled with the
> > -.B nolock
> > -option when using NFS to mount
> > -.I /var
> > -because
> > -.I /var
> > -contains files used by the NLM implementation on Linux.
> > -Using the
> > +The
> > .B nolock
> > -option is also required when mounting exports on NFS servers
> > +option is required when using NFSv2 or NFSv3 to mount servers
> > that do not support the NLM protocol.
> > .TP 1.5i
> > .BR cto " / " nocto
>
> ... can we keep the description like this ?
> -----
> @@ -733,17 +733,14 @@ but such locks provide exclusion only against other applications
> running on the same client.
> Remote applications are not affected by these locks.
> .IP
> -NLM locking must be disabled with the
> +When using NFSv2 or NFSv3, the
> .B nolock
> -option when using NFS to mount
> -.I /var
> -because
> +option is required to mount servers that do not support the NLM protocol,
> +or to mount
> .I /var
> +because
> +.I /var
> contains files used by the NLM implementation on Linux.
> -Using the
> -.B nolock
> -option is also required when mounting exports on NFS servers
> -that do not support the NLM protocol.
> .TP 1.5i
> .BR cto " / " nocto
> Selects whether to use close-to-open cache coherence semantics.
> -----
> Yuki Inoguchi

2022-01-07 08:34:54

by [email protected]

[permalink] [raw]
Subject: RE: client caching and locks

> I could live with that.
>
> Though the other reason I cut it was because I think it needs updates
> too and I wasn't sure exactly how to handle them.
>
> The v4 case is more important and should probably be dealt with first.
> I think the answer there is just "don't mount /var over NFSv4", period.
>
> And maybe we should be more specific: the problem is with /var/lib/nfs,
> not all of /var.

I see it now. Yeah, I agree. If we keep that description, we need to consider the v4 case ...

After reviewing what will happen if we delete this description for /var,
I've changed my mind. I believe it won't cause a serious problem.
I have concluded that it is no longer needed for the following reasons:

1. Users who build a new system with NFSv2 or v3 are not so many nowadays.
2. Even with the nolock option, mounting /var is risky. For example,
if /var is mounted when the system is running, the existing processes would
be prevented from accessing the files they require.

In other words, I agree with your idea. Your first patch looks good to me.
I think we can remove the part "using nolock to mount /var with v2/v3" from the nfs man page.

I apologize for taking your time.

Yuki Inoguchi

2022-01-09 21:59:03

by NeilBrown

[permalink] [raw]
Subject: Re: client caching and locks

On Thu, 06 Jan 2022, '[email protected]' wrote:

> +Locking can also provide cache consistency:
> .P
> -NLM supports advisory file locks only.
> -To lock NFS files, use
> -.BR fcntl (2)
> -with the F_GETLK and F_SETLK commands.
> -The NFS client converts file locks obtained via
> -.BR flock (2)
> -to advisory locks.
> +Before acquiring a file lock, the client revalidates its cached data for
> +the file. Before releasing a write lock, the client flushes to the
> +server's stable storage any data in the locked range.

Surely the client revalidates *after* acquiring the lock on the server.
Otherwise the revalidation has now value.

> .P
> -When mounting servers that do not support the NLM protocol,
> -or when mounting an NFS server through a firewall
> -that blocks the NLM service port,
> -specify the
> -.B nolock
> -mount option. NLM locking must be disabled with the
> -.B nolock
> -option when using NFS to mount
> -.I /var
> -because
> -.I /var
> -contains files used by the NLM implementation on Linux.
> +A distributed application running on multiple NFS clients can take a
> +read lock for each range that it reads and a write lock for each range that
> +it writes. On its own, however, that is insufficient to ensure that
> +reads get up-to-date data.
> .P
> -Specifying the
> -.B nolock
> -option may also be advised to improve the performance
> -of a proprietary application which runs on a single client
> -and uses file locks extensively.
> +When revalidating caches, the client is unable to reliably determine the
> +difference between changes made by other clients and changes it made
> +itself. Therefore, such an application would also need to prevent
> +concurrent writes from multiple clients, either by taking whole-file
> +locks on every write or by some other method.

This looks like it is documenting a bug - I would much rather the bug be
fixed.

If a client opens/reads/closes a file while no other client has the file
open, then it *must* return current data. Currently (according to
reports) it does not reliably do this.

If a write from this client races with a write from another client
(whether or not locking is used), the fact that fetching the change attr
is not atomic w.r.t IO means that the client *cannot* trust any cached
data after it has closed a file to which it wrote to - unless it had a
delegation.
Hmm.. that sounds a bit convoluted.

1/ If a client opens a file for write but does not get a delegation, and
then writes to the file, then when it closes the file it *must*
invalidate any cached data as there could have been a concurrent
write from another client which is not visible in the changeid
information. CTO consistency rules allow the client to keep cached
data up to the close.
2/ If a client opens a file for write and *does* get a delegation, then
providing it gets a changeid from the server after final write and
before returning the delegation, it can keep all cached data (until
the server reports a new changeid).

Note that the inability to cache in '1' *should* *not* be a performance
problem in practice.
a/ if locking is used, cached data is not trusted anyway, so no loss
b/ if locking is not used, then no concurrency is expected, so
delegations are to be expected, so case '1' doesn't apply.

NeilBrown

2022-01-09 22:16:33

by NeilBrown

[permalink] [raw]
Subject: Re: client caching and locks

On Fri, 07 Jan 2022, '[email protected]' wrote:
> On Thu, Jan 06, 2022 at 07:23:16AM +0000, [email protected] wrote:
> > > How about this? I also updated the lock/nolock description and deleted
> > > the end of this subsection since it's redundant with that. And removed
> > > the bit about using nolock to mount /var with v2/v3 as that seems like a
> > > bit of a niche case at this point. If we still want to document that, I
> > > think it belongs elsewhere.
> >
> > Thank you so much for creating the patch!
> >
> > For the most part I agree with you, but I feel unsafe to remove the part
> > "using nolock to mount /var with v2/v3" even if it seems niche case.
> > I'm also not sure if there is another suitable document to migrate it.
> >
> > Therefore, at the end of "lock/nolock" subsection ...
>
> I could live with that.
>
> Though the other reason I cut it was because I think it needs updates
> too and I wasn't sure exactly how to handle them.
>
> The v4 case is more important and should probably be dealt with first.
> I think the answer there is just "don't mount /var over NFSv4", period.

Why not? An NFSv4 client has no business accessing anything in
/var/lib/nfs, so how can it cause a problem?

>
> And maybe we should be more specific: the problem is with /var/lib/nfs,
> not all of /var.

According to "3.2. CLIENT STARTUP" section "C/" in the README that comes
with nfs-utils,
statd should be run before any NFSv2 or NFSv3 filesystem is
mounted with remote locking (i.e. without -o nolock).

so it is only fair to say "the problem is with /var/lib/nfs" if that is
a separate filesystem from /var.

Note that "-o nolock" should also be used for the root filesystem for
root-over-NFSv3, as that must be mounted before statd can be running.

Maybe that it how it should be explained in the man page:

-o nolock should be used to mount any NFSv3 filesystems that are
mounted before rpc.statd can be started, which can only be started
after /var/lib/nfs is available.

NeilBrown

2022-01-09 22:38:09

by J. Bruce Fields

[permalink] [raw]
Subject: Re: client caching and locks

On Mon, Jan 10, 2022 at 09:16:21AM +1100, NeilBrown wrote:
> On Fri, 07 Jan 2022, '[email protected]' wrote:
> > On Thu, Jan 06, 2022 at 07:23:16AM +0000, [email protected] wrote:
> > > > How about this? I also updated the lock/nolock description and deleted
> > > > the end of this subsection since it's redundant with that. And removed
> > > > the bit about using nolock to mount /var with v2/v3 as that seems like a
> > > > bit of a niche case at this point. If we still want to document that, I
> > > > think it belongs elsewhere.
> > >
> > > Thank you so much for creating the patch!
> > >
> > > For the most part I agree with you, but I feel unsafe to remove the part
> > > "using nolock to mount /var with v2/v3" even if it seems niche case.
> > > I'm also not sure if there is another suitable document to migrate it.
> > >
> > > Therefore, at the end of "lock/nolock" subsection ...
> >
> > I could live with that.
> >
> > Though the other reason I cut it was because I think it needs updates
> > too and I wasn't sure exactly how to handle them.
> >
> > The v4 case is more important and should probably be dealt with first.
> > I think the answer there is just "don't mount /var over NFSv4", period.
>
> Why not? An NFSv4 client has no business accessing anything in
> /var/lib/nfs, so how can it cause a problem?

Oops, you're right.

> > And maybe we should be more specific: the problem is with /var/lib/nfs,
> > not all of /var.
>
> According to "3.2. CLIENT STARTUP" section "C/" in the README that comes
> with nfs-utils,
> statd should be run before any NFSv2 or NFSv3 filesystem is
> mounted with remote locking (i.e. without -o nolock).
>
> so it is only fair to say "the problem is with /var/lib/nfs" if that is
> a separate filesystem from /var.
>
> Note that "-o nolock" should also be used for the root filesystem for
> root-over-NFSv3, as that must be mounted before statd can be running.
>
> Maybe that it how it should be explained in the man page:
>
> -o nolock should be used to mount any NFSv3 filesystems that are
> mounted before rpc.statd can be started, which can only be started
> after /var/lib/nfs is available.

Thanks. I fiddled around some more and ended up with the following.--b.

@@ -737,6 +737,10 @@ The
.B nolock
option is required when using NFSv2 or NFSv3 to mount servers
that do not support the NLM protocol.
+Also, /var/lib/nfs must be available before rpc.statd can be started,
+and NFSv3 locking requires rpc.statd, so if any NFSv3 filesystems are
+needed to reach /var/lib/nfs, they must be mounted with
+.B nolock .
.TP 1.5i
.BR cto " / " nocto
Selects whether to use close-to-open cache coherence semantics.


2022-01-09 22:41:26

by J. Bruce Fields

[permalink] [raw]
Subject: Re: client caching and locks

On Mon, Jan 10, 2022 at 08:58:55AM +1100, NeilBrown wrote:
> On Thu, 06 Jan 2022, '[email protected]' wrote:
>
> > +Locking can also provide cache consistency:
> > .P
> > -NLM supports advisory file locks only.
> > -To lock NFS files, use
> > -.BR fcntl (2)
> > -with the F_GETLK and F_SETLK commands.
> > -The NFS client converts file locks obtained via
> > -.BR flock (2)
> > -to advisory locks.
> > +Before acquiring a file lock, the client revalidates its cached data for
> > +the file. Before releasing a write lock, the client flushes to the
> > +server's stable storage any data in the locked range.
>
> Surely the client revalidates *after* acquiring the lock on the server.
> Otherwise the revalidation has now value.

Gah.

@@ -1489,9 +1493,9 @@ locks.
.P
Locking can also provide cache consistency:
.P
-Before acquiring a file lock, the client revalidates its cached data for
-the file. Before releasing a write lock, the client flushes to the
-server's stable storage any data in the locked range.
+After acquiring a file lock and before using any cached data, the client
+revalidates its cache. Before releasing a write lock, the client flushes to
+the server's stable storage any data in the locked range.
.P
A distributed application running on multiple NFS clients can take a
read lock for each range that it reads and a write lock for each range that

--b.

2022-01-17 17:03:43

by [email protected]

[permalink] [raw]
Subject: RE: client caching and locks

> 1/ If a client opens a file for write but does not get a delegation, and
> then writes to the file, then when it closes the file it *must*
> invalidate any cached data as there could have been a concurrent
> write from another client which is not visible in the changeid
> information. CTO consistency rules allow the client to keep cached
> data up to the close.
> 2/ If a client opens a file for write and *does* get a delegation, then
> providing it gets a changeid from the server after final write and
> before returning the delegation, it can keep all cached data (until
> the server reports a new changeid).
>
> Note that the inability to cache in '1' *should* *not* be a performance
> problem in practice.
> a/ if locking is used, cached data is not trusted anyway, so no loss

How about the case for using whole-file lock?
I'm assuming that cached data is trusted in this case, so it could be a performance problem, couldn't it?

Yuki Inoguchi

2022-01-18 03:03:30

by NeilBrown

[permalink] [raw]
Subject: RE: client caching and locks

On Mon, 17 Jan 2022, [email protected] wrote:
> > 1/ If a client opens a file for write but does not get a delegation, and
> > then writes to the file, then when it closes the file it *must*
> > invalidate any cached data as there could have been a concurrent
> > write from another client which is not visible in the changeid
> > information. CTO consistency rules allow the client to keep cached
> > data up to the close.
> > 2/ If a client opens a file for write and *does* get a delegation, then
> > providing it gets a changeid from the server after final write and
> > before returning the delegation, it can keep all cached data (until
> > the server reports a new changeid).
> >
> > Note that the inability to cache in '1' *should* *not* be a performance
> > problem in practice.
> > a/ if locking is used, cached data is not trusted anyway, so no loss
>
> How about the case for using whole-file lock?
> I'm assuming that cached data is trusted in this case, so it could be a performance problem, couldn't it?

I don't think that case adds anything interesting. When the file is
closed, the lock is dropped. If there were any writes without a
delegation, then the changeid isn't a reliable indication that no other
client wrote. So the cache must be dropped.

NeilBrown


>
> Yuki Inoguchi
>

2022-02-03 18:14:43

by [email protected]

[permalink] [raw]
Subject: RE: client caching and locks

> I don't think that case adds anything interesting. When the file is
> closed, the lock is dropped. If there were any writes without a
> delegation, then the changeid isn't a reliable indication that no other
> client wrote. So the cache must be dropped.

I've understood it.

I've made the patch based on your idea. It invalidates the cache
after a client without write-delegation sends CLOSE call.
My Open MPI test program confirmed that this fix resolves the problem.

The patch is as follows. What do you think?
-----
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index b18f31b2c9e7..67021786034d 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -3711,7 +3711,8 @@ static const struct rpc_call_ops nfs4_close_ops = {
*/
int nfs4_do_close(struct nfs4_state *state, gfp_t gfp_mask, int wait)
{
- struct nfs_server *server = NFS_SERVER(state->inode);
+ struct inode *inode = state->inode;
+ struct nfs_server *server = NFS_SERVER(inode);
struct nfs_seqid *(*alloc_seqid)(struct nfs_seqid_counter *, gfp_t);
struct nfs4_closedata *calldata;
struct nfs4_state_owner *sp = state->owner;
@@ -3773,6 +3774,15 @@ int nfs4_do_close(struct nfs4_state *state, gfp_t gfp_mask, int wait)
status = 0;
if (wait)
status = rpc_wait_for_completion_task(task);
+
+ if (status >= 0 && !NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE)) {
+ nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR
+ | NFS_INO_INVALID_DATA
+ | NFS_INO_INVALID_ACCESS
+ | NFS_INO_INVALID_ACL
+ | NFS_INO_REVAL_PAGECACHE);
+ }
+
rpc_put_task(task);
return status;
out_free_calldata:
-----

Yuki Inoguchi

2022-02-03 20:43:07

by [email protected]

[permalink] [raw]
Subject: RE: client caching and locks

Thank you for the review.

> I would make 2 changes.
> 1/ invalidate when opening a file, rather than when closing.
> This fits better with the understanding of close-to-open consistency
> that we flush writes when closing and verify the cache when opening
> 2/ I would be more careful about determining exactly when the
> invalidation might be needed.

Yes, I'm willing to make these changes.

> In nfs_post_op_updatE_inode() there is the code:
>
> if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 &&
> (fattr->valid & NFS_ATTR_FATTR_PRECHANGE) == 0) {
> fattr->pre_change_attr = inode_peek_iversion_raw(inode);
> fattr->valid |= NFS_ATTR_FATTR_PRECHANGE;
> }

You mean nfs_post_op_update_inode_force_wcc_locked(), not nfs_post_op_update_inode(), right?
This is just make sure --- so I can set the new flag in appropriate place :)

> I assume that code doesn't end up running when you write to a file for
> which you have a delegation, but I'm not at all certain - we would have
> to check.

Maybe it is nfs_check_inode_attributes()?
It returns without doing anything if you have a delegation.
It is called from:
nfs_post_op_update_inode_force_wcc_locked()
-> nfs_post_op_update_inode_locked()
-> nfs_refresh_inode_locked()
-> nfs_check_inode_attributes()

1476 static int nfs_check_inode_attributes(struct inode *inode, struct nfs_fattr *fattr)
1477 {
1478 struct nfs_inode *nfsi = NFS_I(inode);
1479 loff_t cur_size, new_isize;
1480 unsigned long invalid = 0;
1481 struct timespec64 ts;
1482
1483 if (NFS_PROTO(inode)->have_delegation(inode, FMODE_READ))
1484 return 0;

Yuki Inoguchi

2022-02-04 15:18:53

by NeilBrown

[permalink] [raw]
Subject: RE: client caching and locks

On Wed, 02 Feb 2022, [email protected] wrote:
> > I don't think that case adds anything interesting. When the file is
> > closed, the lock is dropped. If there were any writes without a
> > delegation, then the changeid isn't a reliable indication that no other
> > client wrote. So the cache must be dropped.
>
> I've understood it.
>
> I've made the patch based on your idea. It invalidates the cache
> after a client without write-delegation sends CLOSE call.
> My Open MPI test program confirmed that this fix resolves the problem.
>
> The patch is as follows. What do you think?

This is a bit too heavy-handed. It invalidates too often.

I would make 2 changes.
1/ invalidate when opening a file, rather than when closing.
This fits better with the understanding of close-to-open consistency
that we flush writes when closing and verify the cache when opening
2/ I would be more careful about determining exactly when the
invalidation might be needed.

In nfs_post_op_updatE_inode() there is the code:

if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 &&
(fattr->valid & NFS_ATTR_FATTR_PRECHANGE) == 0) {
fattr->pre_change_attr = inode_peek_iversion_raw(inode);
fattr->valid |= NFS_ATTR_FATTR_PRECHANGE;
}

This effectively says "if we don't have the PRECHANGE number, then take
the current version number, and pretend that we do have the PRECHANGE
number.

NFSv3 can provide a PRECHANGE number because the protocol allows
pre/post attrs to be atomic. NFSv4 never sets PRECHANGE because the
protocol never guarantees atomicity of pre/post attrs (For files).

So it is at the point in the code where the cache on the client might
become inconsistent with the data on the server. This is not a serious
inconsistency and doesn't need to be resolved immediately. But I think
it should be handled the next time some application opens the file.

I think that when this branch of code is run, we should set a flag on
the inode "NFS_ATTR_MIGHT_BE_INCONSISTENT" (or something like that).
Then when we open a file, if that flag is set then we clear it and set
NFS_INO_INVALID_DATA and maybe NFS_INO_REVAL_PAGECACHE (I don't recall how
those two relate to each other).

I assume that code doesn't end up running when you write to a file for
which you have a delegation, but I'm not at all certain - we would have
to check.

NeilBrown

2022-02-09 09:59:24

by NeilBrown

[permalink] [raw]
Subject: RE: client caching and locks

On Thu, 03 Feb 2022, [email protected] wrote:
> Thank you for the review.
>
> > I would make 2 changes.
> > 1/ invalidate when opening a file, rather than when closing.
> > This fits better with the understanding of close-to-open consistency
> > that we flush writes when closing and verify the cache when opening
> > 2/ I would be more careful about determining exactly when the
> > invalidation might be needed.
>
> Yes, I'm willing to make these changes.
>
> > In nfs_post_op_updatE_inode() there is the code:
> >
> > if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 &&
> > (fattr->valid & NFS_ATTR_FATTR_PRECHANGE) == 0) {
> > fattr->pre_change_attr = inode_peek_iversion_raw(inode);
> > fattr->valid |= NFS_ATTR_FATTR_PRECHANGE;
> > }
>
> You mean nfs_post_op_update_inode_force_wcc_locked(), not nfs_post_op_update_inode(), right?
> This is just make sure --- so I can set the new flag in appropriate place :)

Yes, you are correct.

>
> > I assume that code doesn't end up running when you write to a file for
> > which you have a delegation, but I'm not at all certain - we would have
> > to check.
>
> Maybe it is nfs_check_inode_attributes()?
> It returns without doing anything if you have a delegation.
> It is called from:
> nfs_post_op_update_inode_force_wcc_locked()
> -> nfs_post_op_update_inode_locked()
> -> nfs_refresh_inode_locked()
> -> nfs_check_inode_attributes()
>
> 1476 static int nfs_check_inode_attributes(struct inode *inode, struct nfs_fattr *fattr)
> 1477 {
> 1478 struct nfs_inode *nfsi = NFS_I(inode);
> 1479 loff_t cur_size, new_isize;
> 1480 unsigned long invalid = 0;
> 1481 struct timespec64 ts;
> 1482
> 1483 if (NFS_PROTO(inode)->have_delegation(inode, FMODE_READ))
> 1484 return 0;
>

I don't *think* that is relevant. If you set the new flag where I
suggested, that code will already have run before it gets to the
have_delegation test.

nfs4_write_need_cache_consistency_data() seems relevant.
If that returns false (which it does when there is a delegation) the
WRITE request doesn't even ask for attributes.
In that case if nfs_post_op_update_inode_force_wcc_locked() is called,
it will find that NFS_ATTR_FATTR_CHANGE is not set, so it won't try to
set ->pre_change_attr with a "fake" value, and so won't set the new flag.

NeilBrown