Return-Path: Received: from mail-oi0-f46.google.com ([209.85.218.46]:36081 "EHLO mail-oi0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751780AbbHFVkF (ORCPT ); Thu, 6 Aug 2015 17:40:05 -0400 Received: by oiev193 with SMTP id v193so16158596oie.3 for ; Thu, 06 Aug 2015 14:40:04 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <913A1219-4719-4E0E-ACF5-A6F7C79215C8@oracle.com> References: <20150805182305.11796.75476.stgit@manet.1015granger.net> <359CCE70-6056-418F-B627-688BE27C7EA8@oracle.com> <8F6C1893-11A1-485C-A9EA-0ABBCC4D4286@oracle.com> <913A1219-4719-4E0E-ACF5-A6F7C79215C8@oracle.com> Date: Thu, 6 Aug 2015 17:40:04 -0400 Message-ID: Subject: Re: [PATCH] NFS: Add OTW write barrier before may-open test From: Trond Myklebust To: Chuck Lever Cc: Linux NFS Mailing List Content-Type: text/plain; charset=UTF-8 Sender: linux-nfs-owner@vger.kernel.org List-ID: On Thu, Aug 6, 2015 at 12:56 PM, Chuck Lever wrote: > > > On Aug 5, 2015, at 10:50 PM, Trond Myklebust wrote: > > > On Wed, Aug 5, 2015 at 8:15 PM, Chuck Lever wrote: > >> > >> On Aug 5, 2015, at 6:27 PM, Trond Myklebust wrote: > >> > >>> On Wed, Aug 5, 2015 at 2:44 PM, Chuck Lever wrote: > >>>> > >>>> On Aug 5, 2015, at 2:34 PM, Chuck Lever wrote: > >>>> > >>>>> Commit 14546c337588 ("NFS: Don't do a full flush to disk on close() > >>>>> if we hold a delegation") added an optimization. When an NFSv4 write > >>>>> delegation is present, close(2) does not wait while a file's dirty > >>>>> data is flushed to the NFS server. > >>>>> > >>>>> However, if the application workload immediately re-opens that file, > >>>>> nfs_may_open() can perform an ACCESS and GETATTR which runs > >>>>> concurrently with the flushing WRITE. If the flushing WRITE and > >>>>> GETATTR complete out of order on the server, the file size cached on > >>>>> the client will go backwards, possibly resulting in new writes going > >>>>> to the wrong file offset. > >>>>> > >>>>> Add a write barrier before the access check to ensure the server's > >>>>> idea of the file's size is properly up to date. > >>>>> > >>>>> The downside of this approach is that each fresh open(2) of a dirty > >>>>> file results in an extra flush. It seems to me that _any_ open(2) > >>>>> done while there is dirty data waiting on the client could result in > >>>>> a file size roll back. However, I see bad behavior only when the > >>>>> client holds a write delegation. > >>>>> > >>>>> Fixes: 14546c337588 ("NFS: Don't do a full flush to disk on . . .") > >>>>> Signed-off-by: Chuck Lever > >>>>> --- > >>>>> fs/nfs/dir.c | 9 +++++++++ > >>>>> 1 file changed, 9 insertions(+) > >>>>> > >>>>> I'm not certain this is a good long term fix. Some other possible > >>>>> solutions include: > >>>>> > >>>>> - Not performing the access check if the client holds a delegation > >>>>> - Not performing a GETATTR as part of the ACCESS check > >>>>> - Simply marking the file attributes stale instead of using the > >>>>> returned file size > >>>>> - Reverting commit 14546c337588 > >>>> > >>>> OK. If the client holds a write delegation, then it shouldn't care > >>>> about the server's file size at all until it has flushed all dirty > >>>> data and returned the delegation. So flushing here is probably wrong. > >>>> > >>>> But the incoming file size in the GETATTR is definitely screwing up > >>>> the cached file size. > >>>> > >>> > >>> In which kernels are you seeing the race? For recent kernels (v4.0+) > >>> the write code should be calling nfs_fattr_set_barrier() in order to > >>> prevent the result from the ACCESS from overwriting the new file size. > >> > >> I'm testing on 4.2-rc4. > >> > > > > OK. Does turning off the check for nfs_ctime_need_update() in > > nfs_inode_attrs_need_update() fix the problem? > > I am not able to reproduce the issue when I remove > nfs_ctime_need_update(). > > The test I'm running does this: unpack git onto an NFSv4.0 > mount point, build it, then run "make -j8 test". > > Doing "make -j1 test" always works. The test always works > against Linux NFS servers. > > With Solaris, which aggressively offers write delegations, > the test fails reliably after -j3, at random points during > the "make test" step. > > A network trace was captured during a failure, which > happened to occur during t6030-bisect-porcelain.sh this > time through. > > Here's an excerpt which shows the problem. The failing > script seems to be: > > 98 echo " master" > branch.expect && > 99 echo "* other" >> branch.expect && > > Frame 29091: SETATTR call set size to 0 > Frame 29095: SETATTR reply > > Frame 29097: WRITE call offset 0, " master\n" > Frame 29097: ACCESS/GETATTR call > > [NB: the ACCESS call appears after the WRITE call in the > same TCP frame] > > Frame 29101: ACCESS/GETATTR reply size 0 <<<< > Frame 29102: WRITE reply 9 bytes written > > Frame 29112: WRITE call offset 0, "* other\n" > Frame 29115: WRITE reply 8 bytes written > > The successful case shows the ACCESS and first WRITE replies > in the reverse order; the ACCESS reply shows the file size > is 9; the second WRITE then sends the correct data, which is > " master\n* other\n" . > > Since nfs_may_open() is called only if the client is holding > a delegation, I wonder if that GETATTR is needed? The only > thing may_open is checking is whether the caller is allowed > to use the existing delegation. Shouldn't the GETATTR results > be ignored in every case? We could do that, but I can't really see how this can be particular to delegations. We can always hit this kind of race if a GETATTR and a WRITE happen to race, and the WRITE happens not to return an updated ctime. I can, for instance, see this happening both with NFSv3 servers that don't return post-op attributes, as well as with O_DIRECT and pNFS writes. IOW: I think it is time to attempt to jettison the ctime crutch and attempt to rely only on the attribute barriers. Note also that ctime isn't even listed as a NFSv4 mandatory attribute and so we may eventually hit other cases anyway. I'll run some tests to see if I can catch that generating any new races...