From: Chuck Lever Subject: Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing Date: Fri, 29 May 2009 13:01:38 -0400 Message-ID: <41044976-395B-4ED0-BBA1-153FD76BDA53@oracle.com> References: <5ECD2205-4DC9-41F1-AC5C-ADFA984745D3@oracle.com> <49FA0CE8.9090706@redhat.com> <1241126587.15476.62.camel@heimdal.trondhjem.org> Mime-Version: 1.0 (Apple Message framework v935.3) Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Cc: Trond Myklebust , linux-nfs@vger.kernel.org, linux-nfs-owner@vger.kernel.org, Peter Staubach To: Brian R Cowan Return-path: Received: from rcsinet12.oracle.com ([148.87.113.124]:45134 "EHLO rgminet12.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753213AbZE2RCH (ORCPT ); Fri, 29 May 2009 13:02:07 -0400 In-Reply-To: Sender: linux-nfs-owner@vger.kernel.org List-ID: On May 29, 2009, at 11:55 AM, Brian R Cowan wrote: > Been working this issue with Red hat, and didn't need to go to the > list... > Well, now I do... You mention that "The main type of workload we're > targetting with this patch is the app that opens a file, writes < 4k > and > then closes the file." Well, it appears that this issue also impacts > flushing pages from filesystem caches. > > The reason this came up in my environment is that our product's build > auditing gives the the filesystem cache an interesting workout. When > ClearCase audits a build, the build places data in a few places, > including: > 1) a build audit file that usually resides in /tmp. This build audit > is > essentially a log of EVERY file open/read/write/delete/rename/etc. > that > the programs called in the build script make in the clearcase "view" > you're building in. As a result, this file can get pretty large. > 2) The build outputs themselves, which in this case are being > written to a > remote storage location on a Linux or Solaris server, and > 3) a file called .cmake.state, which is a local cache that is > written to > after the build script completes containing what is essentially a > "Bill of > materials" for the files created during builds in this "view." > > We believe that the build audit file access is causing build output > to get > flushed out of the filesystem cache. These flushes happen *in 4k > chunks.* > This trips over this change since the cache pages appear to get > flushed on > an individual basis. So, are you saying that the application is flushing after every 4KB write(2), or that the application has written a bunch of pages, and VM/ VFS on the client is doing the synchronous page flushes? If it's the application doing this, then you really do not want to mitigate this by defeating the STABLE writes -- the application must have some requirement that the data is permanent. Unless I have misunderstood something, the previous faster behavior was due to cheating, and put your data at risk. I can't see how replacing an UNSTABLE + COMMIT with a single FILE_SYNC write would cause such a significant performance impact. > One note is that if the build outputs were going to a clearcase view > stored on an enterprise-level NAS device, there isn't as much of an > issue > because many of these return from the stable write request as soon > as the > data goes into the battery-backed memory disk cache on the NAS. > However, > it really impacts writes to general-purpose OS's that follow Sun's > lead in > how they handle "stable" writes. The truly annoying part about this > rather > subtle change is that the NFS client is specifically ignoring the > client > mount options since we cannot force the "async" mount option to turn > off > this behavior. You may have a misunderstanding about what exactly "async" does. The "sync" / "async" mount options control only whether the application waits for the data to be flushed to permanent storage. They have no effect on any file system I know of _how_ specifically the data is moved from the page cache to permanent storage. > ================================================================= > Brian Cowan > Advisory Software Engineer > ClearCase Customer Advocacy Group (CAG) > Rational Software > IBM Software Group > 81 Hartwell Ave > Lexington, MA > > Phone: 1.781.372.3580 > Web: http://www.ibm.com/software/rational/support/ > > > Please be sure to update your PMR using ESR at > http://www-306.ibm.com/software/support/probsub.html or cc all > correspondence to sw_support@us.ibm.com to be sure your PMR is > updated in > case I am not available. > > > > From: > Trond Myklebust > To: > Peter Staubach > Cc: > Chuck Lever , Brian R Cowan/Cupertino/ > IBM@IBMUS, > linux-nfs@vger.kernel.org > Date: > 04/30/2009 05:23 PM > Subject: > Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page > flushing > Sent by: > linux-nfs-owner@vger.kernel.org > > > > On Thu, 2009-04-30 at 16:41 -0400, Peter Staubach wrote: >> Chuck Lever wrote: >>> >>> On Apr 30, 2009, at 4:12 PM, Brian R Cowan wrote: >>>> >>>> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ab0a3dbedc51037f3d2e22ef67717a987b3d15e2 > >>>> >> Actually, the "stable" part can be a killer. It depends upon >> why and when nfs_flush_inode() is invoked. >> >> I did quite a bit of work on this aspect of RHEL-5 and discovered >> that this particular code was leading to some serious slowdowns. >> The server would end up doing a very slow FILE_SYNC write when >> all that was really required was an UNSTABLE write at the time. >> >> Did anyone actually measure this optimization and if so, what >> were the numbers? > > As usual, the optimisation is workload dependent. The main type of > workload we're targetting with this patch is the app that opens a > file, > writes < 4k and then closes the file. For that case, it's a no-brainer > that you don't need to split a single stable write into an unstable > + a > commit. > > So if the application isn't doing the above type of short write > followed > by close, then exactly what is causing a flush to disk in the first > place? Ordinarily, the client will try to cache writes until the cows > come home (or until the VM tells it to reclaim memory - whichever > comes > first)... > > Cheers > Trond > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- Chuck Lever chuck[dot]lever[at]oracle[dot]com