Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932440AbXAPHki (ORCPT ); Tue, 16 Jan 2007 02:40:38 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932421AbXAPHki (ORCPT ); Tue, 16 Jan 2007 02:40:38 -0500 Received: from amsfep19-int.chello.nl ([213.46.243.16]:52253 "EHLO amsfep11-int.chello.nl" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S932440AbXAPHkh (ORCPT ); Tue, 16 Jan 2007 02:40:37 -0500 Subject: Re: [RFC 0/8] Cpuset aware writeback From: Peter Zijlstra To: Christoph Lameter Cc: akpm@osdl.org, Paul Menage , linux-kernel@vger.kernel.org, Nick Piggin , linux-mm@kvack.org, Andi Kleen , Paul Jackson , Dave Chinner In-Reply-To: <20070116054743.15358.77287.sendpatchset@schroedinger.engr.sgi.com> References: <20070116054743.15358.77287.sendpatchset@schroedinger.engr.sgi.com> Content-Type: text/plain Date: Tue, 16 Jan 2007 08:38:10 +0100 Message-Id: <1168933090.22935.30.camel@twins> Mime-Version: 1.0 X-Mailer: Evolution 2.8.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2624 Lines: 64 On Mon, 2007-01-15 at 21:47 -0800, Christoph Lameter wrote: > Currently cpusets are not able to do proper writeback since > dirty ratio calculations and writeback are all done for the system > as a whole. This may result in a large percentage of a cpuset > to become dirty without writeout being triggered. Under NFS > this can lead to OOM conditions. > > Writeback will occur during the LRU scans. But such writeout > is not effective since we write page by page and not in inode page > order (regular writeback). > > In order to fix the problem we first of all introduce a method to > establish a map of nodes that contain dirty pages for each > inode mapping. > > Secondly we modify the dirty limit calculation to be based > on the acctive cpuset. > > If we are in a cpuset then we select only inodes for writeback > that have pages on the nodes of the cpuset. > > After we have the cpuset throttling in place we can then make > further fixups: > > A. We can do inode based writeout from direct reclaim > avoiding single page writes to the filesystem. > > B. We add a new counter NR_UNRECLAIMABLE that is subtracted > from the available pages in a node. This allows us to > accurately calculate the dirty ratio even if large portions > of the node have been allocated for huge pages or for > slab pages. What about mlock'ed pages? > There are a couple of points where some better ideas could be used: > > 1. The nodemask expands the inode structure significantly if the > architecture allows a high number of nodes. This is only an issue > for IA64. For that platform we expand the inode structure by 128 byte > (to support 1024 nodes). The last patch attempts to address the issue > by using the knowledge about the maximum possible number of nodes > determined on bootup to shrink the nodemask. Not the prettiest indeed, no ideas though. > 2. The calculation of the per cpuset limits can require looping > over a number of nodes which may bring the performance of get_dirty_limits > near pre 2.6.18 performance (before the introduction of the ZVC counters) > (only for cpuset based limit calculation). There is no way of keeping these > counters per cpuset since cpusets may overlap. Well, you gain functionality, you loose some runtime, sad but probably worth it. Otherwise it all looks good. Acked-by: Peter Zijlstra - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/