From: Chuck Lever Subject: Re: [PATCH] flow control for WRITE requests Date: Fri, 23 Jan 2009 16:13:23 -0500 Message-ID: <25989A0A-2E69-4761-B6DA-AE1938B1792E@oracle.com> References: <497A162C.1010004@redhat.com> <497A2FED.4080407@redhat.com> Mime-Version: 1.0 (Apple Message framework v930.3) Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Cc: Trond Myklebust , NFS list To: Peter Staubach Return-path: Received: from acsinet11.oracle.com ([141.146.126.233]:64225 "EHLO acsinet11.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755423AbZAWVO1 (ORCPT ); Fri, 23 Jan 2009 16:14:27 -0500 In-Reply-To: <497A2FED.4080407@redhat.com> Sender: linux-nfs-owner@vger.kernel.org List-ID: On Jan 23, 2009, at Jan 23, 2009, 4:00 PM, Peter Staubach wrote: > Chuck Lever wrote: >> How does this play with memory pressure? >> >> I would expect the generic background page flushers, like pdflush >> and kswapd, could still cause out-of-order writes. >> > > I believe that they all end up calling into the NFS client > code via the nfs_writepages() interface with the writeback > control struct set to write out pages sequentially, possibly > limiting the number of pages to be written, but usually not. > > I ran this code on a 1GB system, writing out 2GB+ files, and > the only out of order issues that I saw were generated by > the pdflush and the application doing the writing itself > intermixing the generation of WRITE requests. This was > addressed by the introduction of NFS_INO_FLUSHING and the > code in nfs_writepages() which uses it. > > I also ran the code on a 4GB system with the changes, > writing the same sized files, andno out of order WRITE > requests were seen. > > I believe that Nick Piggin solved a bunch of these issues > recently and the NFS_INO_FLUSHING patch along with his > patches seems to have resolved the issues. That's good news. How about ^C -- any weirdness there? pages left pinned on the client, fractured writes on the server, application hangs? Have you tried this patch with a multi-threaded write-intensive async I/O workload? I'm sure you know which one I'm referring to. ;-) Would it make sense to key the new nfs_max_outstanding_writes sysctl to the RPC slot table size for that mount point? Another tunable... that doesn't self-tune... Ugh. How did you arrive at the default value of 16? CITI had been trying to improve write performance on transatlantic 10Gb links. Would be useful to test in that environment. > Thanx... > > ps > >> On Jan 23, 2009, at Jan 23, 2009, 2:10 PM, Peter Staubach wrote: >> >>> Hi. >>> >>> Attached is a patch which implements some flow control for the >>> NFS client to control dirty pages. The flow control is >>> implemented on a per-file basis and causes dirty pages to be >>> written out when the client can detect that the application is >>> writing in a serial fashion and has dirtied enough pages to >>> fill a complete over the wire transfer. >>> >>> This work was precipitated by working on a situation where a >>> server at a customer site was not able to adequately handle >>> the behavior of the Linux NFS client. This particular server >>> required that all data to the file written to the file be >>> written in a strictly serial fashion. It also had problems >>> handling the Linux NFS client semantic of caching a large >>> amount of data and then sending out that data all at once. >>> >>> The sequential ordering problem was resolved by a previous >>> patch which was submitted to the linux-nfs list. This patch >>> addresses the capacity problem. >>> >>> The problem is resolved by sending WRITE requests much >>> earlier in the process of the application writing to the file. >>> The client keeps track of the number of dirty pages associated >>> with the file and also the last offset of the data being >>> written. When the client detects that a full over the wire >>> transfer could be constructed and that the application is >>> writing sequentially, then it generates an UNSTABLE write to >>> server for the currently dirty data. >>> >>> The client also keeps track of the number of these WRITE >>> requests which have been generated. It flow controls based >>> on a configurable maximum. This keeps the client from >>> completely overwhelming the server. >>> >>> A nice side effect of the framework is that the issue of >>> stat()'ing a file being written can be handled much more >>> quickly than before. The amount of data that must be >>> transmitted to the server to satisfy the "latest mtime" >>> requirement is limited. Also, the application writing to >>> the file is blocked until the over the wire GETATTR is >>> completed. This allows the GETATTR to be send and the >>> response received without competing with the data being >>> written. >>> >>> I did not do formal performance testing, but in my >>> watching testing, I did not see any performance regressions. >>> >>> As a side note -- the more natural model of flow control >>> would seem to be at the client/server level instead of >>> the per-file level. However, that level was too coarse >>> with the particular server that was required to be used >>> because its requirements were at the per-file level. >>> >>> Thanx... >>> >>> ps >>> >>> Signed-off-by: Peter Staubach >>> --- linux-2.6.28.i686/fs/nfs/inode.c.org >>> +++ linux-2.6.28.i686/fs/nfs/inode.c >>> @@ -486,8 +486,10 @@ void nfs_setattr_update_inode(struct ino >>> int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, >>> struct kstat *stat) >>> { >>> struct inode *inode = dentry->d_inode; >>> - int need_atime = NFS_I(inode)->cache_validity & >>> NFS_INO_INVALID_ATIME; >>> + struct nfs_inode *nfsi = NFS_I(inode); >>> + int need_atime = nfsi->cache_validity & NFS_INO_INVALID_ATIME; >>> int err; >>> + int inc_outstanding_writes = nfs_max_outstanding_writes; >>> >>> /* >>> * Flush out writes to the server in order to update c/mtime. >>> @@ -497,9 +499,8 @@ int nfs_getattr(struct vfsmount *mnt, st >>> * nfs_wb_nocommit. >>> */ >>> if (S_ISREG(inode->i_mode)) { >>> - mutex_lock(&inode->i_mutex); >>> + atomic_add(inc_outstanding_writes, &nfsi->writes); >>> nfs_wb_nocommit(inode); >>> - mutex_unlock(&inode->i_mutex); >>> } >>> >>> /* >>> @@ -523,6 +524,10 @@ int nfs_getattr(struct vfsmount *mnt, st >>> generic_fillattr(inode, stat); >>> stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode)); >>> } >>> + if (S_ISREG(inode->i_mode)) { >>> + atomic_sub(inc_outstanding_writes, &nfsi->writes); >>> + wake_up(&nfsi->writes_wq); >>> + } >>> return err; >>> } >>> >>> @@ -1288,9 +1293,13 @@ static void init_once(void *foo) >>> INIT_RADIX_TREE(&nfsi->nfs_page_tree, GFP_ATOMIC); >>> nfsi->ncommit = 0; >>> nfsi->npages = 0; >>> + atomic_set(&nfsi->ndirty, 0); >>> atomic_set(&nfsi->silly_count, 1); >>> INIT_HLIST_HEAD(&nfsi->silly_list); >>> init_waitqueue_head(&nfsi->waitqueue); >>> + atomic_set(&nfsi->writes, 0); >>> + init_waitqueue_head(&nfsi->writes_wq); >>> + nfsi->wrpos = 0; >>> nfs4_init_once(nfsi); >>> } >>> >>> --- linux-2.6.28.i686/fs/nfs/write.c.org >>> +++ linux-2.6.28.i686/fs/nfs/write.c >>> @@ -197,7 +197,9 @@ static int nfs_set_page_writeback(struct >>> if (!ret) { >>> struct inode *inode = page->mapping->host; >>> struct nfs_server *nfss = NFS_SERVER(inode); >>> + struct nfs_inode *nfsi = NFS_I(inode); >>> >>> + atomic_dec(&nfsi->ndirty); >>> if (atomic_long_inc_return(&nfss->writeback) > >>> NFS_CONGESTION_ON_THRESH) >>> set_bdi_congested(&nfss->backing_dev_info, WRITE); >>> @@ -310,6 +312,33 @@ static int nfs_writepages_callback(struc >>> return ret; >>> } >>> >>> +int nfs_max_outstanding_writes = 16; >>> + >>> +static void nfs_inc_outstanding_writes(struct inode *inode) >>> +{ >>> + struct nfs_inode *nfsi = NFS_I(inode); >>> + >>> + atomic_inc(&nfsi->writes); >>> +} >>> + >>> +static void nfs_dec_outstanding_writes(struct inode *inode) >>> +{ >>> + struct nfs_inode *nfsi = NFS_I(inode); >>> + >>> + if (atomic_dec_return(&nfsi->writes) < >>> nfs_max_outstanding_writes) >>> + wake_up(&nfsi->writes_wq); >>> +} >>> + >>> +void nfs_wait_for_outstanding_writes(struct inode *inode) >>> +{ >>> + struct nfs_inode *nfsi = NFS_I(inode); >>> + >>> + if (nfs_max_outstanding_writes) { >>> + wait_event(nfsi->writes_wq, >>> + atomic_read(&nfsi->writes) < >>> nfs_max_outstanding_writes); >>> + } >>> +} >>> + >>> int nfs_writepages(struct address_space *mapping, struct >>> writeback_control *wbc) >>> { >>> struct inode *inode = mapping->host; >>> @@ -369,6 +398,7 @@ static int nfs_inode_add_request(struct >>> SetPagePrivate(req->wb_page); >>> set_page_private(req->wb_page, (unsigned long)req); >>> nfsi->npages++; >>> + atomic_inc(&nfsi->ndirty); >>> kref_get(&req->wb_kref); >>> radix_tree_tag_set(&nfsi->nfs_page_tree, req->wb_index, >>> NFS_PAGE_TAG_LOCKED); >>> @@ -405,6 +435,10 @@ static void nfs_inode_remove_request(str >>> static void >>> nfs_mark_request_dirty(struct nfs_page *req) >>> { >>> + struct inode *inode = req->wb_context->path.dentry->d_inode; >>> + struct nfs_inode *nfsi = NFS_I(inode); >>> + >>> + atomic_inc(&nfsi->ndirty); >>> __set_page_dirty_nobuffers(req->wb_page); >>> } >>> >>> @@ -633,6 +667,7 @@ static struct nfs_page *nfs_try_to_updat >>> req->wb_bytes = end - req->wb_offset; >>> else >>> req->wb_bytes = rqend - req->wb_offset; >>> + atomic_inc(&NFS_I(inode)->ndirty); >>> out_unlock: >>> spin_unlock(&inode->i_lock); >>> return req; >>> @@ -855,6 +890,8 @@ static int nfs_write_rpcsetup(struct nfs >>> count, >>> (unsigned long long)data->args.offset); >>> >>> + nfs_inc_outstanding_writes(inode); >>> + >>> task = rpc_run_task(&task_setup_data); >>> if (IS_ERR(task)) >>> return PTR_ERR(task); >>> @@ -1130,7 +1167,7 @@ int nfs_writeback_done(struct rpc_task * >>> */ >>> status = NFS_PROTO(data->inode)->write_done(task, data); >>> if (status != 0) >>> - return status; >>> + goto out; >>> nfs_add_stats(data->inode, NFSIOS_SERVERWRITTENBYTES, resp- >>> >count); >>> >>> #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4) >>> @@ -1186,7 +1223,9 @@ int nfs_writeback_done(struct rpc_task * >>> /* Can't do anything about it except throw an error. */ >>> task->tk_status = -EIO; >>> } >>> - return 0; >>> +out: >>> + nfs_dec_outstanding_writes(data->inode); >>> + return status; >>> } >>> >>> >>> @@ -1546,6 +1585,29 @@ int nfs_wb_page(struct inode *inode, str >>> return nfs_wb_page_priority(inode, page, FLUSH_STABLE); >>> } >>> >>> +/* >>> + * Start the WRITE requests for dirty pages on their way. >>> + * This is used when a sufficient number of dirty pages >>> + * have accumulated. >>> + */ >>> +int nfs_wb_interim(struct inode *inode) >>> +{ >>> + struct address_space *mapping = inode->i_mapping; >>> + struct writeback_control wbc = { >>> + .bdi = mapping->backing_dev_info, >>> + .sync_mode = WB_SYNC_NONE, >>> + .nr_to_write = LONG_MAX, >>> + .range_start = 0, >>> + .range_end = LLONG_MAX, >>> + }; >>> + int ret; >>> + >>> + ret = nfs_writepages(mapping, &wbc); >>> + if (ret < 0) >>> + __mark_inode_dirty(inode, I_DIRTY_PAGES); >>> + return ret; >>> +} >>> + >>> int __init nfs_init_writepagecache(void) >>> { >>> nfs_wdata_cachep = kmem_cache_create("nfs_write_data", >>> --- linux-2.6.28.i686/fs/nfs/sysctl.c.org >>> +++ linux-2.6.28.i686/fs/nfs/sysctl.c >>> @@ -58,6 +58,14 @@ static ctl_table nfs_cb_sysctls[] = { >>> .mode = 0644, >>> .proc_handler = &proc_dointvec, >>> }, >>> + { >>> + .ctl_name = CTL_UNNUMBERED, >>> + .procname = "nfs_max_outstanding_writes", >>> + .data = &nfs_max_outstanding_writes, >>> + .maxlen = sizeof(int), >>> + .mode = 0644, >>> + .proc_handler = &proc_dointvec, >>> + }, >>> { .ctl_name = 0 } >>> }; >>> >>> --- linux-2.6.28.i686/fs/nfs/file.c.org >>> +++ linux-2.6.28.i686/fs/nfs/file.c >>> @@ -512,11 +512,17 @@ static int nfs_need_sync_write(struct fi >>> return 0; >>> } >>> >>> +static int nfs_is_serial(struct inode *inode, loff_t pos) >>> +{ >>> + return NFS_I(inode)->wrpos == pos; >>> +} >>> + >>> static ssize_t nfs_file_write(struct kiocb *iocb, const struct >>> iovec *iov, >>> unsigned long nr_segs, loff_t pos) >>> { >>> - struct dentry * dentry = iocb->ki_filp->f_path.dentry; >>> - struct inode * inode = dentry->d_inode; >>> + struct dentry *dentry = iocb->ki_filp->f_path.dentry; >>> + struct inode *inode = dentry->d_inode; >>> + struct nfs_inode *nfsi = NFS_I(inode); >>> ssize_t result; >>> size_t count = iov_length(iov, nr_segs); >>> >>> @@ -530,6 +536,13 @@ static ssize_t nfs_file_write(struct kio >>> result = -EBUSY; >>> if (IS_SWAPFILE(inode)) >>> goto out_swapfile; >>> + >>> + result = count; >>> + if (!count) >>> + goto out; >>> + >>> + nfs_wait_for_outstanding_writes(inode); >>> + >>> /* >>> * O_APPEND implies that we must revalidate the file length. >>> */ >>> @@ -539,17 +552,22 @@ static ssize_t nfs_file_write(struct kio >>> goto out; >>> } >>> >>> - result = count; >>> - if (!count) >>> - goto out; >>> - >>> nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, count); >>> result = generic_file_aio_write(iocb, iov, nr_segs, pos); >>> /* Return error values for O_SYNC and IS_SYNC() */ >>> - if (result >= 0 && nfs_need_sync_write(iocb->ki_filp, inode)) { >>> - int err = nfs_do_fsync(nfs_file_open_context(iocb- >>> >ki_filp), inode); >>> - if (err < 0) >>> - result = err; >>> + if (result >= 0) { >>> + if (nfs_need_sync_write(iocb->ki_filp, inode)) { >>> + int err; >>> + err = nfs_do_fsync(nfs_file_open_context(iocb- >>> >ki_filp), >>> + inode); >>> + if (err < 0) >>> + result = err; >>> + } else if (nfs_max_outstanding_writes && >>> + nfs_is_serial(inode, pos) && >>> + atomic_read(&nfsi->ndirty) >= NFS_SERVER(inode)- >>> >wpages) >>> + nfs_wb_interim(inode); >>> + if (result > 0) >>> + nfsi->wrpos = pos + result; >>> } >>> out: >>> return result; >>> --- linux-2.6.28.i686/include/linux/nfs_fs.h.org >>> +++ linux-2.6.28.i686/include/linux/nfs_fs.h >>> @@ -168,6 +168,7 @@ struct nfs_inode { >>> >>> unsigned long ncommit, >>> npages; >>> + atomic_t ndirty; >>> >>> /* Open contexts for shared mmap writes */ >>> struct list_head open_files; >>> @@ -186,6 +187,9 @@ struct nfs_inode { >>> fmode_t delegation_state; >>> struct rw_semaphore rwsem; >>> #endif /* CONFIG_NFS_V4*/ >>> + atomic_t writes; /* number of outstanding WRITEs */ >>> + wait_queue_head_t writes_wq; >>> + loff_t wrpos; /* position after last WRITE */ >>> struct inode vfs_inode; >>> }; >>> >>> @@ -459,12 +463,14 @@ extern void nfs_unblock_sillyrename(stru >>> * linux/fs/nfs/write.c >>> */ >>> extern int nfs_congestion_kb; >>> +extern int nfs_max_outstanding_writes; >>> extern int nfs_writepage(struct page *page, struct >>> writeback_control *wbc); >>> extern int nfs_writepages(struct address_space *, struct >>> writeback_control *); >>> extern int nfs_flush_incompatible(struct file *file, struct page >>> *page); >>> extern int nfs_updatepage(struct file *, struct page *, unsigned >>> int, unsigned int); >>> -extern int nfs_writeback_done(struct rpc_task *, struct >>> nfs_write_data *); >>> +extern int nfs_writeback_done(struct rpc_task *, struct >>> nfs_write_data *); >>> extern void nfs_writedata_release(void *); >>> +extern void nfs_wait_for_outstanding_writes(struct inode *); >>> >>> /* >>> * Try to write back everything synchronously (but check the >>> @@ -475,6 +481,7 @@ extern int nfs_wb_all(struct inode *inod >>> extern int nfs_wb_nocommit(struct inode *inode); >>> extern int nfs_wb_page(struct inode *inode, struct page* page); >>> extern int nfs_wb_page_cancel(struct inode *inode, struct page* >>> page); >>> +extern int nfs_wb_interim(struct inode *); >>> #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4) >>> extern int nfs_commit_inode(struct inode *, int); >>> extern struct nfs_write_data *nfs_commitdata_alloc(void); -- Chuck Lever chuck[dot]lever[at]oracle[dot]com