Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932327Ab0AFQ5Q (ORCPT ); Wed, 6 Jan 2010 11:57:16 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932226Ab0AFQ5O (ORCPT ); Wed, 6 Jan 2010 11:57:14 -0500 Received: from mx2.netapp.com ([216.240.18.37]:55246 "EHLO mx2.netapp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932306Ab0AFQ5M convert rfc822-to-8bit (ORCPT ); Wed, 6 Jan 2010 11:57:12 -0500 X-IronPort-AV: E=Sophos;i="4.49,230,1262592000"; d="scan'208";a="298022884" Subject: Re: [PATCH] improve the performance of large sequential write NFS workloads From: Trond Myklebust To: Wu Fengguang Cc: Jan Kara , Steve Rago , Peter Zijlstra , "linux-nfs@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "jens.axboe" , Peter Staubach , Arjan van de Ven , Ingo Molnar , "linux-fsdevel@vger.kernel.org" In-Reply-To: <20100106030346.GA15962@localhost> References: <20091222015907.GA6223@localhost> <1261578107.2606.11.camel@localhost> <20091223180551.GD3159@quack.suse.cz> <1261595574.6775.2.camel@localhost> <20091224025228.GA12477@localhost> <1261656281.3596.1.camel@localhost> <20091225055617.GA8595@localhost> <1262190168.7332.6.camel@localhost> <20091231050441.GB19627@localhost> <1262286828.8151.113.camel@localhost> <20100106030346.GA15962@localhost> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Organization: NetApp Date: Wed, 06 Jan 2010 11:56:02 -0500 Message-ID: <1262796962.4251.91.camel@localhost> Mime-Version: 1.0 X-Mailer: Evolution 2.28.2 (2.28.2-1.fc12) X-OriginalArrivalTime: 06 Jan 2010 16:56:51.0247 (UTC) FILETIME=[3E083BF0:01CA8EF1] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4234 Lines: 116 On Wed, 2010-01-06 at 11:03 +0800, Wu Fengguang wrote: > Trond, > > On Fri, Jan 01, 2010 at 03:13:48AM +0800, Trond Myklebust wrote: > > The above change improves on the existing code, but doesn't solve the > > problem that write_inode() isn't a good match for COMMIT. We need to > > wait for all the unstable WRITE rpc calls to return before we can know > > whether or not a COMMIT is needed (some commercial servers never require > > commit, even if the client requested an unstable write). That was the > > other reason for the change. > > Ah good to know that reason. However we cannot wait for ongoing WRITEs > for unlimited time or pages, otherwise nr_unstable goes up and squeeze > nr_dirty and nr_writeback to zero, and stall the cp process for a long > time, as demonstrated by the trace (more reasoning in previous email). OK. I think we need a mechanism to allow balance_dirty_pages() to communicate to the filesystem that it really is holding too many unstable pages. Currently, all we do is say that 'your total is too big', and then let the filesystem figure out what it needs to do. So how about if we modify your heuristic to do something like this? It applies on top of the previous patch. Cheers Trond --------------------------------------------------------------------------------------------------------- VM/NFS: The VM must tell the filesystem when to free reclaimable pages From: Trond Myklebust balance_dirty_pages() should really tell the filesystem whether or not it has an excess of actual dirty pages, or whether it would be more useful to start freeing up the reclaimable pages. Assume that if the number of dirty pages associated with this backing-dev is less than 1/2 the number of reclaimable pages, then we should concentrate on freeing up the latter. Signed-off-by: Trond Myklebust --- fs/nfs/write.c | 9 +++++++-- include/linux/backing-dev.h | 6 ++++++ mm/page-writeback.c | 7 +++++-- 3 files changed, 18 insertions(+), 4 deletions(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 910be28..36113e6 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -1420,8 +1420,10 @@ int nfs_commit_unstable_pages(struct address_space *mapping, if (wbc->sync_mode != WB_SYNC_ALL && radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, NFS_PAGE_TAG_LOCKED)) { - mark_inode_unstable_pages(inode); - return 0; + if (wbc->bdi == NULL) + goto out_nocommit; + if (wbc->bdi->dirty_exceeded != BDI_RECLAIMABLE_EXCEEDED) + goto out_nocommit; } if (wbc->nonblocking) flags = 0; @@ -1429,6 +1431,9 @@ int nfs_commit_unstable_pages(struct address_space *mapping, if (ret > 0) ret = 0; return ret; +out_nocommit: + mark_inode_unstable_pages(inode); + return 0; } #else diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index fcbc26a..cd1645e 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -94,6 +94,12 @@ struct backing_dev_info { #endif }; +enum bdi_dirty_exceeded_state { + BDI_NO_DIRTY_EXCESS = 0, + BDI_DIRTY_EXCEEDED, + BDI_RECLAIMABLE_EXCEEDED, +}; + int bdi_init(struct backing_dev_info *bdi); void bdi_destroy(struct backing_dev_info *bdi); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 0b19943..0133c8f 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -524,8 +524,11 @@ static void balance_dirty_pages(struct address_space *mapping, (background_thresh + dirty_thresh) / 2) break; - if (!bdi->dirty_exceeded) - bdi->dirty_exceeded = 1; + if (bdi_nr_writeback > bdi_nr_reclaimable / 2) { + if (bdi->dirty_exceeded != BDI_DIRTY_EXCEEDED) + bdi->dirty_exceeded = BDI_DIRTY_EXCEEDED; + } else if (bdi->dirty_exceeded != BDI_RECLAIMABLE_EXCEEDED) + bdi->dirty_exceeded = BDI_RECLAIMABLE_EXCEEDED; /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. * Unstable writes are a feature of certain networked -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/