From: "J. Bruce Fields" Subject: Re: Link performance over NFS degraded in RHEL5. -- was : Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Date: Mon, 15 Jun 2009 22:02:30 -0400 Message-ID: <20090616020230.GB20223@fieldses.org> References: <4A2902E6.2080006@RedHat.com> <4A29144A.6030405@gmail.com> <4A291DE3.2070105@RedHat.com> <1244209956.5410.33.camel@heimdal.trondhjem.org> <4A29243F.8080008@RedHat.com> <20090605160544.GE10975@fieldses.org> <1244219715.5410.40.camel@heimdal.trondhjem.org> <20090615230801.GA18504@fieldses.org> <1245112324.7470.7.camel@heimdal.trondhjem.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Steve Dickson , Tom Talpey , Linux NFS Mailing list To: Trond Myklebust Return-path: Received: from mail.fieldses.org ([141.211.133.115]:60397 "EHLO pickle.fieldses.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751709AbZFPCCb (ORCPT ); Mon, 15 Jun 2009 22:02:31 -0400 In-Reply-To: <1245112324.7470.7.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> Sender: linux-nfs-owner@vger.kernel.org List-ID: On Mon, Jun 15, 2009 at 05:32:04PM -0700, Trond Myklebust wrote: > On Mon, 2009-06-15 at 19:08 -0400, J. Bruce Fields wrote: > > On Fri, Jun 05, 2009 at 12:35:15PM -0400, Trond Myklebust wrote: > > > On Fri, 2009-06-05 at 12:05 -0400, J. Bruce Fields wrote: > > > > On Fri, Jun 05, 2009 at 09:57:19AM -0400, Steve Dickson wrote: > > > > > > > > > > > > > > > Trond Myklebust wrote: > > > > > > On Fri, 2009-06-05 at 09:30 -0400, Steve Dickson wrote: > > > > > >> Tom Talpey wrote: > > > > > >>> On 6/5/2009 7:35 AM, Steve Dickson wrote: > > > > > >>>> Brian R Cowan wrote: > > > > > >>>>> Trond Myklebust wrote on 06/04/2009 > > > > > >>>>> 02:04:58 > > > > > >>>>> PM: > > > > > >>>>> > > > > > >>>>>> Did you try turning off write gathering on the server (i.e. add the > > > > > >>>>>> 'no_wdelay' export option)? As I said earlier, that forces a delay of > > > > > >>>>>> 10ms per RPC call, which might explain the FILE_SYNC slowness. > > > > > >>>>> Just tried it, this seems to be a very useful workaround as well. The > > > > > >>>>> FILE_SYNC write calls come back in about the same amount of time as the > > > > > >>>>> write+commit pairs... Speeds up building regardless of the network > > > > > >>>>> filesystem (ClearCase MVFS or straight NFS). > > > > > >>>> Does anybody had the history as to why 'no_wdelay' is an > > > > > >>>> export default? > > > > > >>> Because "wdelay" is a complete crock? > > > > > >>> > > > > > >>> Adding 10ms to every write RPC only helps if there's a steady > > > > > >>> single-file stream arriving at the server. In most other workloads > > > > > >>> it only slows things down. > > > > > >>> > > > > > >>> The better solution is to continue tuning the clients to issue > > > > > >>> writes in a more sequential and less all-or-nothing fashion. > > > > > >>> There are plenty of other less crock-ful things to do in the > > > > > >>> server, too. > > > > > >> Ok... So do you think removing it as a default would cause > > > > > >> any regressions? > > > > > > > > > > > > It might for NFSv2 clients, since they don't have the option of using > > > > > > unstable writes. I'd therefore prefer a kernel solution that makes write > > > > > > gathering an NFSv2 only feature. > > > > > Sounds good to me! ;-) > > > > > > > > Patch welcomed.--b. > > > > > > Something like this ought to suffice... > > > > Thanks, applied. > > > > I'd also like to apply cleanup something like the following--there's > > probably some cleaner way, but it just bothers me to have this > > write-gathering special case take up the bulk of nfsd_vfs_write.... > > > > --b. > > > > commit bfe7680d68afaf3f0b1195c8976db1fd1f03229d > > Author: J. Bruce Fields > > Date: Mon Jun 15 16:03:53 2009 -0700 > > > > nfsd: Pull write-gathering code out of nfsd_vfs_write > > > > This is a relatively self-contained piece of code that handles a special > > case--move it to its own function. > > > > Signed-off-by: J. Bruce Fields > > > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > > index a8aac7f..de68557 100644 > > --- a/fs/nfsd/vfs.c > > +++ b/fs/nfsd/vfs.c > > @@ -963,6 +963,44 @@ static void kill_suid(struct dentry *dentry) > > mutex_unlock(&dentry->d_inode->i_mutex); > > } > > > > +/* > > + * Gathered writes: If another process is currently writing to the file, > > + * there's a high chance this is another nfsd (triggered by a bulk write > > + * from a client's biod). Rather than syncing the file with each write > > + * request, we sleep for 10 msec. > > + * > > + * I don't know if this roughly approximates C. Juszak's idea of > > + * gathered writes, but it's a nice and simple solution (IMHO), and it > > + * seems to work:-) > > + * > > + * Note: we do this only in the NFSv2 case, since v3 and higher have a > > + * better tool (separate unstable writes and commits) for solving this > > + * problem. > > + */ > > +static void wait_for_concurrent_writes(struct file *file, int use_wgather, int *host_err) > > +{ > > + struct inode *inode = file->f_path.dentry->d_inode; > > + static ino_t last_ino; > > + static dev_t last_dev; > > + > > + if (!use_wgather) > > + goto out; > > + if (atomic_read(&inode->i_writecount) > 1 > > + || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) { > > + dprintk("nfsd: write defer %d\n", task_pid_nr(current)); > > + msleep(10); > > + dprintk("nfsd: write resume %d\n", task_pid_nr(current)); > > + } > > + > > + if (inode->i_state & I_DIRTY) { > > + dprintk("nfsd: write sync %d\n", task_pid_nr(current)); > > + *host_err = nfsd_sync(file); > > + } > > +out: > > + last_ino = inode->i_ino; > > + last_dev = inode->i_sb->s_dev; > > +} > > Shouldn't you also timestamp the last_ino/last_dev? Currently you can > end up waiting even if the last time you referenced this file was 10 > minutes ago... Maybe, but I don't know that avoiding the delay in the case where use_wdelay writes are coming rarely is particularly important. (Note this is just a single static last_ino/last_dev, so the timestamp would just tell us how long ago there was last a use_wdelay write.) I'm not as interested in making wdelay work better--someone who uses v2 and wants to benchmark it can do that--as I am interested in just getting it out of the way so I don't have to look at it again.... --b. > > > + > > static __be32 > > nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, > > loff_t offset, struct kvec *vec, int vlen, > > @@ -1025,41 +1063,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, > > if (host_err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID))) > > kill_suid(dentry); > > > > - if (host_err >= 0 && stable) { > > - static ino_t last_ino; > > - static dev_t last_dev; > > - > > - /* > > - * Gathered writes: If another process is currently > > - * writing to the file, there's a high chance > > - * this is another nfsd (triggered by a bulk write > > - * from a client's biod). Rather than syncing the > > - * file with each write request, we sleep for 10 msec. > > - * > > - * I don't know if this roughly approximates > > - * C. Juszak's idea of gathered writes, but it's a > > - * nice and simple solution (IMHO), and it seems to > > - * work:-) > > - */ > > - if (use_wgather) { > > - if (atomic_read(&inode->i_writecount) > 1 > > - || (last_ino == inode->i_ino && last_dev == inode->i_sb->s_dev)) { > > - dprintk("nfsd: write defer %d\n", task_pid_nr(current)); > > - msleep(10); > > - dprintk("nfsd: write resume %d\n", task_pid_nr(current)); > > - } > > - > > - if (inode->i_state & I_DIRTY) { > > - dprintk("nfsd: write sync %d\n", task_pid_nr(current)); > > - host_err=nfsd_sync(file); > > - } > > -#if 0 > > - wake_up(&inode->i_wait); > > -#endif > > - } > > - last_ino = inode->i_ino; > > - last_dev = inode->i_sb->s_dev; > > - } > > + if (host_err >= 0 && stable) > > + wait_for_concurrent_writes(file, use_wgather, &host_err); > > > > dprintk("nfsd: write complete host_err=%d\n", host_err); > > if (host_err >= 0) { > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > >