Return-Path: linux-nfs-owner@vger.kernel.org Received: from aserp1040.oracle.com ([141.146.126.69]:20921 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932176AbbBZPIq convert rfc822-to-8bit (ORCPT ); Thu, 26 Feb 2015 10:08:46 -0500 Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: File Read Returns Non-existent Null Bytes From: Chuck Lever In-Reply-To: <1424914057.41161.8.camel@primarydata.com> Date: Thu, 26 Feb 2015 10:08:34 -0500 Cc: Chris Perl , Linux NFS Mailing List , Chris Perl Message-Id: <6DC0A7F3-F647-494F-9554-2C78067F5A39@oracle.com> References: <1424911067.41161.2.camel@primarydata.com> <1424911390.41161.4.camel@primarydata.com> <1424914057.41161.8.camel@primarydata.com> To: Trond Myklebust Sender: linux-nfs-owner@vger.kernel.org List-ID: On Feb 25, 2015, at 8:27 PM, Trond Myklebust wrote: > On Wed, 2015-02-25 at 19:43 -0500, Trond Myklebust wrote: >> On Wed, 2015-02-25 at 19:37 -0500, Trond Myklebust wrote: >>> On Wed, 2015-02-25 at 17:32 -0500, Chuck Lever wrote: >>>> FWIW it?s easy to reproduce a similar race with fsx, and I encounter >>>> it frequently while running xfstests on fast NFS servers. >>>> >>>> fsx invokes ftruncate following a set of asynchronous reads >>>> (generated possibly due to readahead). The reads are started first, >>>> then the SETATTR, but they complete out of order. >>>> >>>> The SETATTR changes the test file?s size, and the completion >>>> updates the file size in the client?s inode. Then the read requests >>>> complete on the client and set the file?s size back to its old value. >>>> >>>> All it takes is one late read completion, and the cached file size >>>> is corrupted. fsx detects the file size mismatch and terminates the >>>> test. The file size is corrected by a subsequent GETATTR (say, an >>>> ?ls -l? to check it after fsx has terminated). >>>> >>>> While SETATTR blocks concurrent writes, there?s no serialization >>>> on either the client or server to help guarantee the ordering of >>>> SETATTR with read operations. >>>> >>>> I?ve found a successful workaround by forcing the client to ignore >>>> post-op attrs in read replies. A stronger solution might simply set >>>> the ?file attributes need update? flag in the inode if any file >>>> attribute mutation is noticed during a read completion. >>> >>> That's different. We definitely should aim to fix this kind of issue >>> since you are talking about a single client being the only thing >>> accessing the file on the server. >>> How about the following patch? >> >> Let me retry without the typos. The following will actually >> compile... :-/ > > Third version: this contains a minor optimisation so that we don't force > the re-read of the last page in the file. Also a little more detail in > the changelog. Thanks, I?ll give this a shot, hopefully today. > 8<-------------------------------------------------------------------- > From d2c822d64a68542665e4887b4d7bccd6e8e3a741 Mon Sep 17 00:00:00 2001 > From: Trond Myklebust > Date: Wed, 25 Feb 2015 19:26:28 -0500 > Subject: [PATCH] NFS: Quiesce reads before updating the file size after > truncating > > Chuck Lever reports seeing readaheads racing with truncate operations > and causing the file size to be reverted. Fix is to quiesce reads > after truncating the file on the server, but before updating the > size on the client. > > Note: > > 1) We believe that waiting for those reads that lie outside the new > file boundaries is sufficient, since this means any attempt to read > the data in those areas will trigger a new RPC call, which will be > ordered w.r.t. the size change on the server. > > 2) We do not move the call to truncate_pagecache(), since we want > to ensure that any partial page zeroing is ordered w.r.t. the > update of inode->i_size. > > Reported-by: Chuck Lever > Signed-off-by: Trond Myklebust > --- > fs/nfs/inode.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c > index 83107be3dd01..daf3b1e183fe 100644 > --- a/fs/nfs/inode.c > +++ b/fs/nfs/inode.c > @@ -565,6 +565,11 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset) > if (err) > goto out; > > + /* Quiesce reads that lie outside the new file size */ > + invalidate_inode_pages2_range(inode->i_mapping, > + (offset + PAGE_CACHE_SHIFT - 1) >> PAGE_CACHE_SHIFT, > + -1); > + > spin_lock(&inode->i_lock); > i_size_write(inode, offset); > /* Optimisation */ > -- > 2.1.0 > > > -- Chuck Lever chuck[dot]lever[at]oracle[dot]com