From: Steve Rago Subject: Re: [PATCH] improve the performance of large sequential write NFS workloads Date: Sat, 19 Dec 2009 09:25:47 -0500 Message-ID: <1261232747.1947.194.camel@serenity> References: <1261015420.1947.54.camel@serenity> <1261037877.27920.36.camel@laptop> <20091219122033.GA11360@localhost> Mime-Version: 1.0 Content-Type: text/plain Cc: Peter Zijlstra , "linux-nfs@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "Trond.Myklebust@netapp.com" , "jens.axboe" , Peter Staubach To: Wu Fengguang Return-path: In-Reply-To: <20091219122033.GA11360@localhost> Sender: linux-kernel-owner@vger.kernel.org List-ID: On Sat, 2009-12-19 at 20:20 +0800, Wu Fengguang wrote: > Hi Steve, > > // I should really read the NFS code, but maybe you can help us better > // understand the problem :) > > On Thu, Dec 17, 2009 at 04:17:57PM +0800, Peter Zijlstra wrote: > > On Wed, 2009-12-16 at 21:03 -0500, Steve Rago wrote: > > > Eager Writeback for NFS Clients > > > ------------------------------- > > > Prevent applications that write large sequential streams of data (like backup, for example) > > > from entering into a memory pressure state, which degrades performance by falling back to > > > synchronous operations (both synchronous writes and additional commits). > > What exactly is the "memory pressure state" condition? What's the > code to do the "synchronous writes and additional commits" and maybe > how they are triggered? Memory pressure occurs when most of the client pages have been dirtied by an application (think backup server writing multi-gigabyte files that exceed the size of main memory). The system works harder to be able to free dirty pages so that they can be reused. For a local file system, this means writing the pages to disk. For NFS, however, the writes leave the pages in an "unstable" state until the server responds to a commit request. Generally speaking, commit processing is far more expensive than write processing on the server; both are done with the inode locked, but since the commit takes so long, all writes are blocked, which stalls the pipeline. Flushing is generated from the flush kernel threads (nee pdflush) and from the applications themselves (balance_dirty_pages), as well as periodic sync (kupdate style). This is roughly controlled by adjusting dirty_ratio and dirty_background_ratio (along with dirty_expire_centisecs and dirty_writeback_centisecs). In addition, when the client system *really* needs a page deep down in the page allocator, it can generate a synchronous write request of individual pages. This is just about as expensive as a commit, roughly speaking, again stalling the pipeline. > > > > This is accomplished by preventing the client application from > > > dirtying pages faster than they can be written to the server: > > > clients write pages eagerly instead of lazily. > > We already have the balance_dirty_pages() based global throttling. > So what makes the performance difference in your proposed "per-inode" throttling? > balance_dirty_pages() does have much larger threshold than yours. I originally spent several months playing with the balance_dirty_pages algorithm. The main drawback is that it affects more than the inodes that the caller is writing and that the control of what to do is too coarse. My final changes (which worked well for 1Gb connections) were more heuristic than the changes in the patch -- I basically had to come up with alternate ways to write pages without generating commits on inodes. Doing this was distasteful, as I was adjusting generic system behavior for an NFS-only problem. Then a colleague found Peter Staubach's patch, which worked just as well in less code, and isolated the change to the NFS component, which is where it belongs. > > > > The eager writeback is controlled by a sysctl: fs.nfs.nfs_max_woutstanding set to 0 disables > > > the feature. Otherwise it contains the maximum number of outstanding NFS writes that can be > > > in flight for a given file. This is used to block the application from dirtying more pages > > > until the writes are complete. > > What if we do heuristic write-behind for sequential NFS writes? Part of the patch does implement a heuristic write-behind. See where nfs_wb_eager() is called. > > Another related proposal from Peter Staubach is to start async writeback > (without the throttle in your proposal) when one inode have enough pages > dirtied: > > Another approach that I suggested was to keep track of the > number of pages which are dirty on a per-inode basis. When > enough pages are dirty to fill an over the wire transfer, > then schedule an asynchronous write to transmit that data to > the server. This ties in with support to ensure that the > server/network is not completely overwhelmed by the client > by flow controlling the writing application to better match > the bandwidth and latencies of the network and server. > With this support, the NFS client tends not to fill memory > with dirty pages and thus, does not depend upon the other > parts of the system to flush these pages. > > Can the above alternatives fix the same problem? (or perhaps, is the > per-inode throttling really necessary?) This alternative *is contained in* the patch (as this is mostly Peter's code anyway; all I've done is the analysis and porting). The throttling is necessary to prevent the client from continuing to fill all of its memory with dirty pages, which it can always do faster than it can write pages to the server. > > > > This patch is based heavily (okay, almost entirely) on a prior patch by Peter Staubach. For > > > the original patch, see http://article.gmane.org/gmane.linux.nfs/24323. > > > > > > The patch below applies to linux-2.6.32-rc7, but it should apply cleanly to vanilla linux-2.6.32. > > > > > > Performance data and tuning notes can be found on my web site (http://www.nec-labs.com/~sar). > > > With iozone, I see about 50% improvement for large sequential write workloads over a 1Gb Ethernet. > > > With an in-house micro-benchmark, I see 80% improvement for large, single-stream, sequential > > > workloads (where "large" is defined to be greater than the memory size on the client). > > These are impressive numbers. I wonder what would be the minimal patch > (just hacking it to fast, without all the aux bits)? Is it this chunk > to call nfs_wb_eager()? The first half is the same as before, with different indentation. The last half is indeed the heuristic to call nfs_wb_eager() to invoke asynchronous write-behind. With these changes, the number of NFS commit messages drops from a few thousands to tens when writing a 32GB file over NFS. This is mainly because the server is writing dirty pages from its cache in the background, so when a commit comes along, it has less work to do (as opposed to writing all of the pages on demand and then waiting for the commit). I have a second set of changes, which I have not yet submitted, that removes these commits, but it extends the protocol (in a backward-compatible way), which will probably be a harder sell. Steve > > > > @@ -623,10 +635,21 @@ static ssize_t nfs_file_write(struct kio > > > nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, count); > > > result = generic_file_aio_write(iocb, iov, nr_segs, pos); > > > /* Return error values for O_SYNC and IS_SYNC() */ > > > - if (result >= 0 && nfs_need_sync_write(iocb->ki_filp, inode)) { > > > - int err = nfs_do_fsync(nfs_file_open_context(iocb->ki_filp), inode); > > > - if (err < 0) > > > - result = err; > > > + if (result >= 0) { > > > + if (nfs_need_sync_write(iocb->ki_filp, inode)) { > > > + int err; > > > + > > > + err = nfs_do_fsync(nfs_file_open_context(iocb->ki_filp), > > > + inode); > > > + if (err < 0) > > > + result = err; > > > + } else if (nfs_max_woutstanding != 0 && > > > + nfs_is_seqwrite(inode, pos) && > > > + atomic_read(&nfsi->ndirty) >= NFS_SERVER(inode)->wpages) { > > > + nfs_wb_eager(inode); > > > + } > > > + if (result > 0) > > > + nfsi->wrpos = pos + result; > > > } > > Thanks, > Fengguang >