Return-Path: linux-nfs-owner@vger.kernel.org Received: from mail-qg0-f48.google.com ([209.85.192.48]:50165 "EHLO mail-qg0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752748AbbBZB1l (ORCPT ); Wed, 25 Feb 2015 20:27:41 -0500 Received: by mail-qg0-f48.google.com with SMTP id l89so173588qgf.7 for ; Wed, 25 Feb 2015 17:27:40 -0800 (PST) Message-ID: <1424914057.41161.8.camel@primarydata.com> Subject: Re: File Read Returns Non-existent Null Bytes From: Trond Myklebust To: Chuck Lever Cc: Chris Perl , Linux NFS Mailing List , Chris Perl Date: Wed, 25 Feb 2015 20:27:37 -0500 In-Reply-To: <1424911390.41161.4.camel@primarydata.com> References: <1424911067.41161.2.camel@primarydata.com> <1424911390.41161.4.camel@primarydata.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Sender: linux-nfs-owner@vger.kernel.org List-ID: On Wed, 2015-02-25 at 19:43 -0500, Trond Myklebust wrote: > On Wed, 2015-02-25 at 19:37 -0500, Trond Myklebust wrote: > > On Wed, 2015-02-25 at 17:32 -0500, Chuck Lever wrote: > > > FWIW it’s easy to reproduce a similar race with fsx, and I encounter > > > it frequently while running xfstests on fast NFS servers. > > > > > > fsx invokes ftruncate following a set of asynchronous reads > > > (generated possibly due to readahead). The reads are started first, > > > then the SETATTR, but they complete out of order. > > > > > > The SETATTR changes the test file’s size, and the completion > > > updates the file size in the client’s inode. Then the read requests > > > complete on the client and set the file’s size back to its old value. > > > > > > All it takes is one late read completion, and the cached file size > > > is corrupted. fsx detects the file size mismatch and terminates the > > > test. The file size is corrected by a subsequent GETATTR (say, an > > > “ls -l” to check it after fsx has terminated). > > > > > > While SETATTR blocks concurrent writes, there’s no serialization > > > on either the client or server to help guarantee the ordering of > > > SETATTR with read operations. > > > > > > I’ve found a successful workaround by forcing the client to ignore > > > post-op attrs in read replies. A stronger solution might simply set > > > the “file attributes need update” flag in the inode if any file > > > attribute mutation is noticed during a read completion. > > > > That's different. We definitely should aim to fix this kind of issue > > since you are talking about a single client being the only thing > > accessing the file on the server. > > How about the following patch? > > Let me retry without the typos. The following will actually > compile... :-/ Third version: this contains a minor optimisation so that we don't force the re-read of the last page in the file. Also a little more detail in the changelog. 8<-------------------------------------------------------------------- >From d2c822d64a68542665e4887b4d7bccd6e8e3a741 Mon Sep 17 00:00:00 2001 From: Trond Myklebust Date: Wed, 25 Feb 2015 19:26:28 -0500 Subject: [PATCH] NFS: Quiesce reads before updating the file size after truncating Chuck Lever reports seeing readaheads racing with truncate operations and causing the file size to be reverted. Fix is to quiesce reads after truncating the file on the server, but before updating the size on the client. Note: 1) We believe that waiting for those reads that lie outside the new file boundaries is sufficient, since this means any attempt to read the data in those areas will trigger a new RPC call, which will be ordered w.r.t. the size change on the server. 2) We do not move the call to truncate_pagecache(), since we want to ensure that any partial page zeroing is ordered w.r.t. the update of inode->i_size. Reported-by: Chuck Lever Signed-off-by: Trond Myklebust --- fs/nfs/inode.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index 83107be3dd01..daf3b1e183fe 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -565,6 +565,11 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset) if (err) goto out; + /* Quiesce reads that lie outside the new file size */ + invalidate_inode_pages2_range(inode->i_mapping, + (offset + PAGE_CACHE_SHIFT - 1) >> PAGE_CACHE_SHIFT, + -1); + spin_lock(&inode->i_lock); i_size_write(inode, offset); /* Optimisation */ -- 2.1.0