Return-Path: Received: from mx1.redhat.com ([209.132.183.28]:41776 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751962AbcGYSYw (ORCPT ); Mon, 25 Jul 2016 14:24:52 -0400 From: "Benjamin Coddington" To: "Trond Myklebust" Cc: "hch@infradead.org" , "List Linux NFS Mailing" Subject: Re: [PATCH v4 24/28] Getattr doesn't require data sync semantics Date: Mon, 25 Jul 2016 14:26:33 -0400 Message-ID: In-Reply-To: References: <1467844205-76852-23-git-send-email-trond.myklebust@primarydata.com> <1467844205-76852-24-git-send-email-trond.myklebust@primarydata.com> <1467844205-76852-25-git-send-email-trond.myklebust@primarydata.com> <20160718034847.GA1195@infradead.org> <1468817945.5273.2.camel@primarydata.com> <20160719035843.GA24437@infradead.org> <20160721082216.GA15247@infradead.org> <9E2543EF-0442-491E-AB22-3E1667DABBA4@redhat.com> <70375D08-FCA9-4C73-9991-38490C11B171@redhat.com> <8B8BB73D-39A6-4291-A10E-736B0F907459@primarydata.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-nfs-owner@vger.kernel.org List-ID: On 25 Jul 2016, at 12:39, Trond Myklebust wrote: >> On Jul 25, 2016, at 12:26, Benjamin Coddington >> wrote: >> >> On 21 Jul 2016, at 9:20, Trond Myklebust wrote: >> >>>> On Jul 21, 2016, at 09:05, Benjamin Coddington >>>> wrote: >>>> >>>> So back to Christoph's point earlier: >>>> >>>> On 17 Jul 2016, at 23:48, Christoph Hellwig wrote: >>>>> This one breaks xfstests generic/207 on block/scsi layout for me. >>>>> The >>>>> reason for that is that we need a layoutcommit after writing out >>>>> all >>>>> data for the file for the file size to be updated on the server. >>>> >>>> You responded: >>>> >>>> On 18 Jul 2016, at 0:32, Trond Myklebust wrote: >>>>> I’m not understanding this argument. Why do we care if the file >>>>> size is up >>>>> to date on the server if we’re not sending an actual GETATTR on >>>>> the wire >>>>> to retrieve the file size? >>>> >>>> I guess the answer might be because we can get it back from the >>>> last >>>> LAYOUTCOMMIT. >>>> >>> >>> The patch that I followed up with should now ensure that we do not >>> mark the attribute cache as up to date if there is a LAYOUTCOMMIT >>> pending. >>> IOW: when the pNFS write is done, it is expected to do 2 things: >>> >>> 1) mark the inode for LAYOUTCOMMIT >>> 2) mark the attribute cache as invalid (because we know the change >>> attribute, mtime, ctime need to be updates) >>> >>> In the case of blocks pNFS write: >>> The call to pnfs_set_layoutcommit() in pnfs_ld_write_done() should >>> take care of (1) >>> The call to nfs_writeback_update_inode() in nfs4_write_done_cb() >>> should take care of (2). >>> >>> Provided that these 2 calls are performed in the above order, then >>> any call to nfs_getattr() which has not been preceded by a call to >>> nfs4_proc_layoutcommit() should trigger the call to >>> __nfs_revalidate_inode(). >> >> I think the problem is that a following nfs_getattr() will fail to >> notice >> the size change in the case of a write_completion and layoutcommit >> occuring >> after nfs_getattr() has done pnfs_sync_inode() but before it has done >> nfs_update_inode(). >> >> In the failing case there are two threads one is doing writes, the >> other >> doing lstat on aio_complete via io_getevents(2). >> >> For each write completion the lstat thread tries to verify the file >> size. >> >> GETATTR Thread LAYOUTCOMMIT Thread >> -------------- -------------------- >> write_completion sets LAYOUTCOMMIT >> (4096@0) >> --> nfs_getattr > > filemap_write_and_wait() > >> __nfs_revalidate_inode >> pnfs_sync_inode > > NFS_PROTO(inode)->getattr() > >> getattr sees 4096 >> write_completion sets LAYOUTCOMMIT >> (4096@4096) >> sets LAYOUTCOMMITING >> clears LAYOUTCOMMIT >> clears LAYOUTCOMMITTING >> nfs_refresh_inode >> nfs_update_inode size is 4096 >> <-- nfs_getattr >> >> At this point the cached attributes are seen as up to date, but >> aio-dio-extend-stat program expects that second write_completion to >> reflect >> in the file size. >> > > Why isn’t the filemap_write_and_wait() above resolving the race? > I’d > expect that would move your “write completion sets LAYOUTCOMMIT” > up to > before the pnfs_sync_inode(). In fact, in the patch that Christoph > sent, > all he was doing was moving the pnfs_sync_inode() to immediately after > that filemap_write_and_wait() instead of relying on it in > _nfs_revalidate_inode. This is O_DIRECT, I've failed to mention yet. The second write hasn't made it out of __nfs_pageio_add_request() at the time filemap_write_and_wait() is called. It is sleeping in pnfs_update_layout() waiting on a LAYOUTGET and it doesn't resumes until after filemap_write_and_wait().