2015-08-05 18:34:52

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH] NFS: Add OTW write barrier before may-open test

Commit 14546c337588 ("NFS: Don't do a full flush to disk on close()
if we hold a delegation") added an optimization. When an NFSv4 write
delegation is present, close(2) does not wait while a file's dirty
data is flushed to the NFS server.

However, if the application workload immediately re-opens that file,
nfs_may_open() can perform an ACCESS and GETATTR which runs
concurrently with the flushing WRITE. If the flushing WRITE and
GETATTR complete out of order on the server, the file size cached on
the client will go backwards, possibly resulting in new writes going
to the wrong file offset.

Add a write barrier before the access check to ensure the server's
idea of the file's size is properly up to date.

The downside of this approach is that each fresh open(2) of a dirty
file results in an extra flush. It seems to me that _any_ open(2)
done while there is dirty data waiting on the client could result in
a file size roll back. However, I see bad behavior only when the
client holds a write delegation.

Fixes: 14546c337588 ("NFS: Don't do a full flush to disk on . . .")
Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfs/dir.c | 9 +++++++++
1 file changed, 9 insertions(+)

I'm not certain this is a good long term fix. Some other possible
solutions include:

- Not performing the access check if the client holds a delegation
- Not performing a GETATTR as part of the ACCESS check
- Simply marking the file attributes stale instead of using the
returned file size
- Reverting commit 14546c337588

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 547308a..7b36993 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -2439,6 +2439,15 @@ static int nfs_open_permission_mask(int openflags)

int nfs_may_open(struct inode *inode, struct rpc_cred *cred, int openflags)
{
+ /*
+ * If a write delegation is present, close(2) does not wait after
+ * flushing dirty data. Wait for write completion here to ensure
+ * the file size on the server is up to date. Otherwise this
+ * access check will roll back the cached file size.
+ */
+ if (NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE))
+ nfs_sync_inode(inode);
+
return nfs_do_access(inode, cred, nfs_open_permission_mask(openflags));
}
EXPORT_SYMBOL_GPL(nfs_may_open);



2015-08-05 18:44:39

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH] NFS: Add OTW write barrier before may-open test


On Aug 5, 2015, at 2:34 PM, Chuck Lever <[email protected]> wrote:

> Commit 14546c337588 ("NFS: Don't do a full flush to disk on close()
> if we hold a delegation") added an optimization. When an NFSv4 write
> delegation is present, close(2) does not wait while a file's dirty
> data is flushed to the NFS server.
>
> However, if the application workload immediately re-opens that file,
> nfs_may_open() can perform an ACCESS and GETATTR which runs
> concurrently with the flushing WRITE. If the flushing WRITE and
> GETATTR complete out of order on the server, the file size cached on
> the client will go backwards, possibly resulting in new writes going
> to the wrong file offset.
>
> Add a write barrier before the access check to ensure the server's
> idea of the file's size is properly up to date.
>
> The downside of this approach is that each fresh open(2) of a dirty
> file results in an extra flush. It seems to me that _any_ open(2)
> done while there is dirty data waiting on the client could result in
> a file size roll back. However, I see bad behavior only when the
> client holds a write delegation.
>
> Fixes: 14546c337588 ("NFS: Don't do a full flush to disk on . . .")
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> fs/nfs/dir.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> I'm not certain this is a good long term fix. Some other possible
> solutions include:
>
> - Not performing the access check if the client holds a delegation
> - Not performing a GETATTR as part of the ACCESS check
> - Simply marking the file attributes stale instead of using the
> returned file size
> - Reverting commit 14546c337588

OK. If the client holds a write delegation, then it shouldn't care
about the server's file size at all until it has flushed all dirty
data and returned the delegation. So flushing here is probably wrong.

But the incoming file size in the GETATTR is definitely screwing up
the cached file size.


> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 547308a..7b36993 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -2439,6 +2439,15 @@ static int nfs_open_permission_mask(int openflags)
>
> int nfs_may_open(struct inode *inode, struct rpc_cred *cred, int openflags)
> {
> + /*
> + * If a write delegation is present, close(2) does not wait after
> + * flushing dirty data. Wait for write completion here to ensure
> + * the file size on the server is up to date. Otherwise this
> + * access check will roll back the cached file size.
> + */
> + if (NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE))
> + nfs_sync_inode(inode);
> +
> return nfs_do_access(inode, cred, nfs_open_permission_mask(openflags));
> }
> EXPORT_SYMBOL_GPL(nfs_may_open);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




2015-08-05 22:27:54

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] NFS: Add OTW write barrier before may-open test

On Wed, Aug 5, 2015 at 2:44 PM, Chuck Lever <[email protected]> wrote:
>
> On Aug 5, 2015, at 2:34 PM, Chuck Lever <[email protected]> wrote:
>
>> Commit 14546c337588 ("NFS: Don't do a full flush to disk on close()
>> if we hold a delegation") added an optimization. When an NFSv4 write
>> delegation is present, close(2) does not wait while a file's dirty
>> data is flushed to the NFS server.
>>
>> However, if the application workload immediately re-opens that file,
>> nfs_may_open() can perform an ACCESS and GETATTR which runs
>> concurrently with the flushing WRITE. If the flushing WRITE and
>> GETATTR complete out of order on the server, the file size cached on
>> the client will go backwards, possibly resulting in new writes going
>> to the wrong file offset.
>>
>> Add a write barrier before the access check to ensure the server's
>> idea of the file's size is properly up to date.
>>
>> The downside of this approach is that each fresh open(2) of a dirty
>> file results in an extra flush. It seems to me that _any_ open(2)
>> done while there is dirty data waiting on the client could result in
>> a file size roll back. However, I see bad behavior only when the
>> client holds a write delegation.
>>
>> Fixes: 14546c337588 ("NFS: Don't do a full flush to disk on . . .")
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> fs/nfs/dir.c | 9 +++++++++
>> 1 file changed, 9 insertions(+)
>>
>> I'm not certain this is a good long term fix. Some other possible
>> solutions include:
>>
>> - Not performing the access check if the client holds a delegation
>> - Not performing a GETATTR as part of the ACCESS check
>> - Simply marking the file attributes stale instead of using the
>> returned file size
>> - Reverting commit 14546c337588
>
> OK. If the client holds a write delegation, then it shouldn't care
> about the server's file size at all until it has flushed all dirty
> data and returned the delegation. So flushing here is probably wrong.
>
> But the incoming file size in the GETATTR is definitely screwing up
> the cached file size.
>

In which kernels are you seeing the race? For recent kernels (v4.0+)
the write code should be calling nfs_fattr_set_barrier() in order to
prevent the result from the ACCESS from overwriting the new file size.

2015-08-06 00:15:40

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH] NFS: Add OTW write barrier before may-open test


On Aug 5, 2015, at 6:27 PM, Trond Myklebust <[email protected]> wrote:

> On Wed, Aug 5, 2015 at 2:44 PM, Chuck Lever <[email protected]> wrote:
>>
>> On Aug 5, 2015, at 2:34 PM, Chuck Lever <[email protected]> wrote:
>>
>>> Commit 14546c337588 ("NFS: Don't do a full flush to disk on close()
>>> if we hold a delegation") added an optimization. When an NFSv4 write
>>> delegation is present, close(2) does not wait while a file's dirty
>>> data is flushed to the NFS server.
>>>
>>> However, if the application workload immediately re-opens that file,
>>> nfs_may_open() can perform an ACCESS and GETATTR which runs
>>> concurrently with the flushing WRITE. If the flushing WRITE and
>>> GETATTR complete out of order on the server, the file size cached on
>>> the client will go backwards, possibly resulting in new writes going
>>> to the wrong file offset.
>>>
>>> Add a write barrier before the access check to ensure the server's
>>> idea of the file's size is properly up to date.
>>>
>>> The downside of this approach is that each fresh open(2) of a dirty
>>> file results in an extra flush. It seems to me that _any_ open(2)
>>> done while there is dirty data waiting on the client could result in
>>> a file size roll back. However, I see bad behavior only when the
>>> client holds a write delegation.
>>>
>>> Fixes: 14546c337588 ("NFS: Don't do a full flush to disk on . . .")
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> ---
>>> fs/nfs/dir.c | 9 +++++++++
>>> 1 file changed, 9 insertions(+)
>>>
>>> I'm not certain this is a good long term fix. Some other possible
>>> solutions include:
>>>
>>> - Not performing the access check if the client holds a delegation
>>> - Not performing a GETATTR as part of the ACCESS check
>>> - Simply marking the file attributes stale instead of using the
>>> returned file size
>>> - Reverting commit 14546c337588
>>
>> OK. If the client holds a write delegation, then it shouldn't care
>> about the server's file size at all until it has flushed all dirty
>> data and returned the delegation. So flushing here is probably wrong.
>>
>> But the incoming file size in the GETATTR is definitely screwing up
>> the cached file size.
>>
>
> In which kernels are you seeing the race? For recent kernels (v4.0+)
> the write code should be calling nfs_fattr_set_barrier() in order to
> prevent the result from the ACCESS from overwriting the new file size.

I'm testing on 4.2-rc4.


--
Chuck Lever




2015-08-06 02:50:22

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] NFS: Add OTW write barrier before may-open test

On Wed, Aug 5, 2015 at 8:15 PM, Chuck Lever <[email protected]> wrote:
>
> On Aug 5, 2015, at 6:27 PM, Trond Myklebust <[email protected]> wrote:
>
>> On Wed, Aug 5, 2015 at 2:44 PM, Chuck Lever <[email protected]> wrote:
>>>
>>> On Aug 5, 2015, at 2:34 PM, Chuck Lever <[email protected]> wrote:
>>>
>>>> Commit 14546c337588 ("NFS: Don't do a full flush to disk on close()
>>>> if we hold a delegation") added an optimization. When an NFSv4 write
>>>> delegation is present, close(2) does not wait while a file's dirty
>>>> data is flushed to the NFS server.
>>>>
>>>> However, if the application workload immediately re-opens that file,
>>>> nfs_may_open() can perform an ACCESS and GETATTR which runs
>>>> concurrently with the flushing WRITE. If the flushing WRITE and
>>>> GETATTR complete out of order on the server, the file size cached on
>>>> the client will go backwards, possibly resulting in new writes going
>>>> to the wrong file offset.
>>>>
>>>> Add a write barrier before the access check to ensure the server's
>>>> idea of the file's size is properly up to date.
>>>>
>>>> The downside of this approach is that each fresh open(2) of a dirty
>>>> file results in an extra flush. It seems to me that _any_ open(2)
>>>> done while there is dirty data waiting on the client could result in
>>>> a file size roll back. However, I see bad behavior only when the
>>>> client holds a write delegation.
>>>>
>>>> Fixes: 14546c337588 ("NFS: Don't do a full flush to disk on . . .")
>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>> ---
>>>> fs/nfs/dir.c | 9 +++++++++
>>>> 1 file changed, 9 insertions(+)
>>>>
>>>> I'm not certain this is a good long term fix. Some other possible
>>>> solutions include:
>>>>
>>>> - Not performing the access check if the client holds a delegation
>>>> - Not performing a GETATTR as part of the ACCESS check
>>>> - Simply marking the file attributes stale instead of using the
>>>> returned file size
>>>> - Reverting commit 14546c337588
>>>
>>> OK. If the client holds a write delegation, then it shouldn't care
>>> about the server's file size at all until it has flushed all dirty
>>> data and returned the delegation. So flushing here is probably wrong.
>>>
>>> But the incoming file size in the GETATTR is definitely screwing up
>>> the cached file size.
>>>
>>
>> In which kernels are you seeing the race? For recent kernels (v4.0+)
>> the write code should be calling nfs_fattr_set_barrier() in order to
>> prevent the result from the ACCESS from overwriting the new file size.
>
> I'm testing on 4.2-rc4.
>

OK. Does turning off the check for nfs_ctime_need_update() in
nfs_inode_attrs_need_update() fix the problem?

2015-08-06 16:56:42

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH] NFS: Add OTW write barrier before may-open test


On Aug 5, 2015, at 10:50 PM, Trond Myklebust <[email protected]> wrote:

> On Wed, Aug 5, 2015 at 8:15 PM, Chuck Lever <[email protected]> wrote:
>>
>> On Aug 5, 2015, at 6:27 PM, Trond Myklebust <[email protected]> wrote:
>>
>>> On Wed, Aug 5, 2015 at 2:44 PM, Chuck Lever <[email protected]> wrote:
>>>>
>>>> On Aug 5, 2015, at 2:34 PM, Chuck Lever <[email protected]> wrote:
>>>>
>>>>> Commit 14546c337588 ("NFS: Don't do a full flush to disk on close()
>>>>> if we hold a delegation") added an optimization. When an NFSv4 write
>>>>> delegation is present, close(2) does not wait while a file's dirty
>>>>> data is flushed to the NFS server.
>>>>>
>>>>> However, if the application workload immediately re-opens that file,
>>>>> nfs_may_open() can perform an ACCESS and GETATTR which runs
>>>>> concurrently with the flushing WRITE. If the flushing WRITE and
>>>>> GETATTR complete out of order on the server, the file size cached on
>>>>> the client will go backwards, possibly resulting in new writes going
>>>>> to the wrong file offset.
>>>>>
>>>>> Add a write barrier before the access check to ensure the server's
>>>>> idea of the file's size is properly up to date.
>>>>>
>>>>> The downside of this approach is that each fresh open(2) of a dirty
>>>>> file results in an extra flush. It seems to me that _any_ open(2)
>>>>> done while there is dirty data waiting on the client could result in
>>>>> a file size roll back. However, I see bad behavior only when the
>>>>> client holds a write delegation.
>>>>>
>>>>> Fixes: 14546c337588 ("NFS: Don't do a full flush to disk on . . .")
>>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>>> ---
>>>>> fs/nfs/dir.c | 9 +++++++++
>>>>> 1 file changed, 9 insertions(+)
>>>>>
>>>>> I'm not certain this is a good long term fix. Some other possible
>>>>> solutions include:
>>>>>
>>>>> - Not performing the access check if the client holds a delegation
>>>>> - Not performing a GETATTR as part of the ACCESS check
>>>>> - Simply marking the file attributes stale instead of using the
>>>>> returned file size
>>>>> - Reverting commit 14546c337588
>>>>
>>>> OK. If the client holds a write delegation, then it shouldn't care
>>>> about the server's file size at all until it has flushed all dirty
>>>> data and returned the delegation. So flushing here is probably wrong.
>>>>
>>>> But the incoming file size in the GETATTR is definitely screwing up
>>>> the cached file size.
>>>>
>>>
>>> In which kernels are you seeing the race? For recent kernels (v4.0+)
>>> the write code should be calling nfs_fattr_set_barrier() in order to
>>> prevent the result from the ACCESS from overwriting the new file size.
>>
>> I'm testing on 4.2-rc4.
>>
>
> OK. Does turning off the check for nfs_ctime_need_update() in
> nfs_inode_attrs_need_update() fix the problem?

I am not able to reproduce the issue when I remove
nfs_ctime_need_update().

The test I'm running does this: unpack git onto an NFSv4.0
mount point, build it, then run "make -j8 test".

Doing "make -j1 test" always works. The test always works
against Linux NFS servers.

With Solaris, which aggressively offers write delegations,
the test fails reliably after -j3, at random points during
the "make test" step.

A network trace was captured during a failure, which
happened to occur during t6030-bisect-porcelain.sh this
time through.

Here's an excerpt which shows the problem. The failing
script seems to be:

98 echo " master" > branch.expect &&
99 echo "* other" >> branch.expect &&

Frame 29091: SETATTR call set size to 0
Frame 29095: SETATTR reply

Frame 29097: WRITE call offset 0, " master\n"
Frame 29097: ACCESS/GETATTR call

[NB: the ACCESS call appears after the WRITE call in the
same TCP frame]

Frame 29101: ACCESS/GETATTR reply size 0 <<<<
Frame 29102: WRITE reply 9 bytes written

Frame 29112: WRITE call offset 0, "* other\n"
Frame 29115: WRITE reply 8 bytes written

The successful case shows the ACCESS and first WRITE replies
in the reverse order; the ACCESS reply shows the file size
is 9; the second WRITE then sends the correct data, which is
" master\n* other\n" .

Since nfs_may_open() is called only if the client is holding
a delegation, I wonder if that GETATTR is needed? The only
thing may_open is checking is whether the caller is allowed
to use the existing delegation. Shouldn't the GETATTR results
be ignored in every case?

--
Chuck Lever




2015-08-06 21:40:05

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] NFS: Add OTW write barrier before may-open test

On Thu, Aug 6, 2015 at 12:56 PM, Chuck Lever <[email protected]> wrote:
>
>
> On Aug 5, 2015, at 10:50 PM, Trond Myklebust <[email protected]> wrote:
>
> > On Wed, Aug 5, 2015 at 8:15 PM, Chuck Lever <[email protected]> wrote:
> >>
> >> On Aug 5, 2015, at 6:27 PM, Trond Myklebust <[email protected]> wrote:
> >>
> >>> On Wed, Aug 5, 2015 at 2:44 PM, Chuck Lever <[email protected]> wrote:
> >>>>
> >>>> On Aug 5, 2015, at 2:34 PM, Chuck Lever <[email protected]> wrote:
> >>>>
> >>>>> Commit 14546c337588 ("NFS: Don't do a full flush to disk on close()
> >>>>> if we hold a delegation") added an optimization. When an NFSv4 write
> >>>>> delegation is present, close(2) does not wait while a file's dirty
> >>>>> data is flushed to the NFS server.
> >>>>>
> >>>>> However, if the application workload immediately re-opens that file,
> >>>>> nfs_may_open() can perform an ACCESS and GETATTR which runs
> >>>>> concurrently with the flushing WRITE. If the flushing WRITE and
> >>>>> GETATTR complete out of order on the server, the file size cached on
> >>>>> the client will go backwards, possibly resulting in new writes going
> >>>>> to the wrong file offset.
> >>>>>
> >>>>> Add a write barrier before the access check to ensure the server's
> >>>>> idea of the file's size is properly up to date.
> >>>>>
> >>>>> The downside of this approach is that each fresh open(2) of a dirty
> >>>>> file results in an extra flush. It seems to me that _any_ open(2)
> >>>>> done while there is dirty data waiting on the client could result in
> >>>>> a file size roll back. However, I see bad behavior only when the
> >>>>> client holds a write delegation.
> >>>>>
> >>>>> Fixes: 14546c337588 ("NFS: Don't do a full flush to disk on . . .")
> >>>>> Signed-off-by: Chuck Lever <[email protected]>
> >>>>> ---
> >>>>> fs/nfs/dir.c | 9 +++++++++
> >>>>> 1 file changed, 9 insertions(+)
> >>>>>
> >>>>> I'm not certain this is a good long term fix. Some other possible
> >>>>> solutions include:
> >>>>>
> >>>>> - Not performing the access check if the client holds a delegation
> >>>>> - Not performing a GETATTR as part of the ACCESS check
> >>>>> - Simply marking the file attributes stale instead of using the
> >>>>> returned file size
> >>>>> - Reverting commit 14546c337588
> >>>>
> >>>> OK. If the client holds a write delegation, then it shouldn't care
> >>>> about the server's file size at all until it has flushed all dirty
> >>>> data and returned the delegation. So flushing here is probably wrong.
> >>>>
> >>>> But the incoming file size in the GETATTR is definitely screwing up
> >>>> the cached file size.
> >>>>
> >>>
> >>> In which kernels are you seeing the race? For recent kernels (v4.0+)
> >>> the write code should be calling nfs_fattr_set_barrier() in order to
> >>> prevent the result from the ACCESS from overwriting the new file size.
> >>
> >> I'm testing on 4.2-rc4.
> >>
> >
> > OK. Does turning off the check for nfs_ctime_need_update() in
> > nfs_inode_attrs_need_update() fix the problem?
>
> I am not able to reproduce the issue when I remove
> nfs_ctime_need_update().
>
> The test I'm running does this: unpack git onto an NFSv4.0
> mount point, build it, then run "make -j8 test".
>
> Doing "make -j1 test" always works. The test always works
> against Linux NFS servers.
>
> With Solaris, which aggressively offers write delegations,
> the test fails reliably after -j3, at random points during
> the "make test" step.
>
> A network trace was captured during a failure, which
> happened to occur during t6030-bisect-porcelain.sh this
> time through.
>
> Here's an excerpt which shows the problem. The failing
> script seems to be:
>
> 98 echo " master" > branch.expect &&
> 99 echo "* other" >> branch.expect &&
>
> Frame 29091: SETATTR call set size to 0
> Frame 29095: SETATTR reply
>
> Frame 29097: WRITE call offset 0, " master\n"
> Frame 29097: ACCESS/GETATTR call
>
> [NB: the ACCESS call appears after the WRITE call in the
> same TCP frame]
>
> Frame 29101: ACCESS/GETATTR reply size 0 <<<<
> Frame 29102: WRITE reply 9 bytes written
>
> Frame 29112: WRITE call offset 0, "* other\n"
> Frame 29115: WRITE reply 8 bytes written
>
> The successful case shows the ACCESS and first WRITE replies
> in the reverse order; the ACCESS reply shows the file size
> is 9; the second WRITE then sends the correct data, which is
> " master\n* other\n" .
>
> Since nfs_may_open() is called only if the client is holding
> a delegation, I wonder if that GETATTR is needed? The only
> thing may_open is checking is whether the caller is allowed
> to use the existing delegation. Shouldn't the GETATTR results
> be ignored in every case?

We could do that, but I can't really see how this can be particular to
delegations. We can always hit this kind of race if a GETATTR and a
WRITE happen to race, and the WRITE happens not to return an updated
ctime. I can, for instance, see this happening both with NFSv3 servers
that don't return post-op attributes, as well as with O_DIRECT and
pNFS writes.

IOW: I think it is time to attempt to jettison the ctime crutch and
attempt to rely only on the attribute barriers. Note also that ctime
isn't even listed as a NFSv4 mandatory attribute and so we may
eventually hit other cases anyway.

I'll run some tests to see if I can catch that generating any new races...