Return-Path: Received: from mx1.redhat.com ([209.132.183.28]:34617 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754595AbcGZR4A convert rfc822-to-8bit (ORCPT ); Tue, 26 Jul 2016 13:56:00 -0400 From: "Benjamin Coddington" To: "Trond Myklebust" Cc: "hch@infradead.org" , "List Linux NFS Mailing" Subject: Re: [PATCH v4 24/28] Getattr doesn't require data sync semantics Date: Tue, 26 Jul 2016 13:57:40 -0400 Message-ID: <4CA4361B-2501-48AA-8B2C-715E5A3FD35C@redhat.com> In-Reply-To: <3448E968-3564-43C1-ADBB-A16C9687B0F3@primarydata.com> References: <1467844205-76852-23-git-send-email-trond.myklebust@primarydata.com> <1467844205-76852-24-git-send-email-trond.myklebust@primarydata.com> <1467844205-76852-25-git-send-email-trond.myklebust@primarydata.com> <20160718034847.GA1195@infradead.org> <1468817945.5273.2.camel@primarydata.com> <20160719035843.GA24437@infradead.org> <20160721082216.GA15247@infradead.org> <9E2543EF-0442-491E-AB22-3E1667DABBA4@redhat.com> <70375D08-FCA9-4C73-9991-38490C11B171@redhat.com> <8B8BB73D-39A6-4291-A10E-736B0F907459@primarydata.com> <737F0071-AAD2-49B5-9BD0-02DF98049B76@primarydata.com> <10749CC8-F76E-4DBD-9C27-9802475CF8A5@redhat.com> <3448E968-3564-43C1-ADBB-A16C9687B0F3@primarydata.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-nfs-owner@vger.kernel.org List-ID: On 26 Jul 2016, at 12:35, Trond Myklebust wrote: >> On Jul 26, 2016, at 12:32, Benjamin Coddington >> wrote: >> >> On 25 Jul 2016, at 14:41, Benjamin Coddington wrote: >> >>> On 25 Jul 2016, at 14:34, Trond Myklebust wrote: >>> >>>>> On Jul 25, 2016, at 14:26, Benjamin Coddington >>>>> wrote: >>>>> >>>>> >>>>> >>>>> On 25 Jul 2016, at 12:39, Trond Myklebust wrote: >>>>> >>>>>>> On Jul 25, 2016, at 12:26, Benjamin Coddington >>>>>>> wrote: >>>>>>> >>>>>>> On 21 Jul 2016, at 9:20, Trond Myklebust wrote: >>>>>>> >>>>>>>>> On Jul 21, 2016, at 09:05, Benjamin Coddington >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> So back to Christoph's point earlier: >>>>>>>>> >>>>>>>>> On 17 Jul 2016, at 23:48, Christoph Hellwig wrote: >>>>>>>>>> This one breaks xfstests generic/207 on block/scsi layout for >>>>>>>>>> me. The >>>>>>>>>> reason for that is that we need a layoutcommit after writing >>>>>>>>>> out all >>>>>>>>>> data for the file for the file size to be updated on the >>>>>>>>>> server. >>>>>>>>> >>>>>>>>> You responded: >>>>>>>>> >>>>>>>>> On 18 Jul 2016, at 0:32, Trond Myklebust wrote: >>>>>>>>>> I’m not understanding this argument. Why do we care if the >>>>>>>>>> file size is up >>>>>>>>>> to date on the server if we’re not sending an actual >>>>>>>>>> GETATTR on the wire >>>>>>>>>> to retrieve the file size? >>>>>>>>> >>>>>>>>> I guess the answer might be because we can get it back from >>>>>>>>> the last >>>>>>>>> LAYOUTCOMMIT. >>>>>>>>> >>>>>>>> >>>>>>>> The patch that I followed up with should now ensure that we do >>>>>>>> not mark the attribute cache as up to date if there is a >>>>>>>> LAYOUTCOMMIT pending. >>>>>>>> IOW: when the pNFS write is done, it is expected to do 2 >>>>>>>> things: >>>>>>>> >>>>>>>> 1) mark the inode for LAYOUTCOMMIT >>>>>>>> 2) mark the attribute cache as invalid (because we know the >>>>>>>> change attribute, mtime, ctime need to be updates) >>>>>>>> >>>>>>>> In the case of blocks pNFS write: >>>>>>>> The call to pnfs_set_layoutcommit() in pnfs_ld_write_done() >>>>>>>> should take care of (1) >>>>>>>> The call to nfs_writeback_update_inode() in >>>>>>>> nfs4_write_done_cb() should take care of (2). >>>>>>>> >>>>>>>> Provided that these 2 calls are performed in the above order, >>>>>>>> then any call to nfs_getattr() which has not been preceded by a >>>>>>>> call to nfs4_proc_layoutcommit() should trigger the call to >>>>>>>> __nfs_revalidate_inode(). >>>>>>> >>>>>>> I think the problem is that a following nfs_getattr() will fail >>>>>>> to notice >>>>>>> the size change in the case of a write_completion and >>>>>>> layoutcommit occuring >>>>>>> after nfs_getattr() has done pnfs_sync_inode() but before it has >>>>>>> done >>>>>>> nfs_update_inode(). >>>>>>> >>>>>>> In the failing case there are two threads one is doing writes, >>>>>>> the other >>>>>>> doing lstat on aio_complete via io_getevents(2). >>>>>>> >>>>>>> For each write completion the lstat thread tries to verify the >>>>>>> file size. >>>>>>> >>>>>>> GETATTR Thread LAYOUTCOMMIT Thread >>>>>>> -------------- -------------------- >>>>>>> write_completion sets LAYOUTCOMMIT >>>>>>> (4096@0) >>>>>>> --> nfs_getattr >>>>>> >>>>>> filemap_write_and_wait() >>>>>> >>>>>>> __nfs_revalidate_inode >>>>>>> pnfs_sync_inode >>>>>> >>>>>> NFS_PROTO(inode)->getattr() >>>>>> >>>>>>> getattr sees 4096 >>>>>>> write_completion sets LAYOUTCOMMIT >>>>>>> (4096@4096) >>>>>>> sets LAYOUTCOMMITING >>>>>>> clears LAYOUTCOMMIT >>>>>>> clears LAYOUTCOMMITTING >>>>>>> nfs_refresh_inode >>>>>>> nfs_update_inode size is 4096 >>>>>>> <-- nfs_getattr >>>>>>> >>>>>>> At this point the cached attributes are seen as up to date, but >>>>>>> aio-dio-extend-stat program expects that second write_completion >>>>>>> to reflect >>>>>>> in the file size. >>>>>>> >>>>>> >>>>>> Why isn’t the filemap_write_and_wait() above resolving the >>>>>> race? I’d >>>>>> expect that would move your “write completion sets >>>>>> LAYOUTCOMMIT” up to >>>>>> before the pnfs_sync_inode(). In fact, in the patch that >>>>>> Christoph sent, >>>>>> all he was doing was moving the pnfs_sync_inode() to immediately >>>>>> after >>>>>> that filemap_write_and_wait() instead of relying on it in >>>>>> _nfs_revalidate_inode. >>>>> >>>>> This is O_DIRECT, I've failed to mention yet. The second write >>>>> hasn't made >>>>> it out of __nfs_pageio_add_request() at the time >>>>> filemap_write_and_wait() is >>>>> called. It is sleeping in pnfs_update_layout() waiting on a >>>>> LAYOUTGET and it >>>>> doesn't resumes until after filemap_write_and_wait(). >>>> >>>> Wait, so you have 1 thread doing an O_DIRECT write() and another >>>> doing a >>>> stat() in parallel? Why would there be an expectation that the >>>> filesystem >>>> should serialise those system calls? >>> >>> Not exactly parallel, but synchronized on aio_complete. A comment >>> in >>> generic/207's src/aio-dio-regress/aio-dio-extend-stat.c: >>> >>> 36 /* >>> 37 * This was originally submitted to >>> 38 * http://bugzilla.kernel.org/show_bug.cgi?id=6831 by >>> 39 * Rafal Wijata . It caught a race in dio >>> aio completion >>> 40 * that would call aio_complete() before the dio callers would >>> update i_size. >>> 41 * A stat after io_getevents() would not see the new file size. >>> 42 * >>> 43 * The bug was fixed in the fs/direct-io.c completion reworking >>> that appeared >>> 44 * in 2.6.20. This test should fail on 2.6.19. >>> 45 */ >>> >>> As far as I can see, this check is the whole point of generic/207.. >> >> This would fix it up: >> >> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c >> index f108d58101f8..823700f827b6 100644 >> --- a/fs/nfs/inode.c >> +++ b/fs/nfs/inode.c >> @@ -661,6 +661,7 @@ int nfs_getattr(struct vfsmount *mnt, struct >> dentry >> *dentry, struct kstat *stat) >> >> trace_nfs_getattr_enter(inode); >> /* Flush out writes to the server in order to update c/mtime. >> */ >> + nfs_start_io_read(inode); >> if (S_ISREG(inode->i_mode)) { >> err = filemap_write_and_wait(inode->i_mapping); >> if (err) >> @@ -694,6 +695,7 @@ int nfs_getattr(struct vfsmount *mnt, struct >> dentry >> *dentry, struct kstat *stat) >> stat->blksize = NFS_SERVER(inode)->dtsize; >> } >> out: >> + nfs_end_io_read(inode); >> trace_nfs_getattr_exit(inode, err); >> return err; >> } >> >> Trond, what do you think? I'll take any additional silence as a sign >> to go >> elsewhere. :P > > No. The above locking excludes all writes as well as O_DIRECT reads… > That’s worse than we had before. > > I’d like rather to understand _why_ the aio_complete() is failing to > work correctly here. According to the analysis of the test case that > you quoted yesterday, the O_DIRECT writes should have completed before > we even call stat(). > > Cheers > Trond The O_DIRECT writes do complete, and every completion signals the other thread to do stat(), but that completion does not update the size on the server. As we know, we need a LAYOUTCOMMIT. After this patch, we're only going to do a LAYOUTCOMMIT if nfs_need_revalidate_inode(inode). So what happens is that the first write completes, and the first nfs_getattr() triggers the first LAYOUTCOMMIT, and then a GETATTR. Simultaneously, the second write is waiting on a LAYOUTGET. The GETATTR completes and sets the size to 4k. Now the attributes are marked up to date with a size of 4k, and when the second write completes, nfs_getattr() is called, nfs_need_revalidate_inode() is false, and we don't bother to send another LAYOUTCOMMIT or GETATTR to correct the cached file size. Here's a function graph of that: http://people.redhat.com/bcodding/borken We could invalidate the attribute cache every time a write completes.. maybe nfs_writeback_update_inode() isn't doing the job for block layouts (are we not setting res.count? I'll look at that..) I think we could also use the LAYOUTCOMMIT results to invalidate the cache, RFC-5661 18.42.3: If the metadata server updates the file's size as the result of the LAYOUTCOMMIT operation, it must return the new size (locr_newsize.ns_size) as part of the results." I'm not sure you want to bother to try to reproduce -- but if so, you don't need special hardware for SCSI layout. I wrote up a quick how-to for SCSI layouts on a VM in qemu: http://people.redhat.com/bcodding/pnfs/nfs/scsi/2016/07/13/pnfs_scsi_setup_for_VMs/ Ben