Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261386AbVDDUfC (ORCPT ); Mon, 4 Apr 2005 16:35:02 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261399AbVDDUew (ORCPT ); Mon, 4 Apr 2005 16:34:52 -0400 Received: from fire.osdl.org ([65.172.181.4]:987 "EHLO smtp.osdl.org") by vger.kernel.org with ESMTP id S261389AbVDDUeT (ORCPT ); Mon, 4 Apr 2005 16:34:19 -0400 Date: Mon, 4 Apr 2005 13:33:33 -0700 From: Andrew Morton To: "Martin J. Bligh" Cc: cmm@us.ibm.com, andrea@suse.de, linux-kernel@vger.kernel.org, ext2-devel@lists.sourceforge.net, sct@redhat.com, janetinc@us.ibm.com Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests Message-Id: <20050404133333.79fd9d93.akpm@osdl.org> In-Reply-To: <37420000.1112646263@flay> References: <20050315204413.GF20253@csail.mit.edu> <20050316003134.GY7699@opteron.random> <20050316040435.39533675.akpm@osdl.org> <20050316183701.GB21597@opteron.random> <1111607584.5786.55.camel@localhost.localdomain> <20050403183544.7c31f85c.akpm@osdl.org> <1112633417.3703.8.camel@dyn318043bld.beaverton.ibm.com> <20050404130441.53ab480b.akpm@osdl.org> <37420000.1112646263@flay> X-Mailer: Sylpheed version 0.9.7 (GTK+ 1.2.10; i386-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2779 Lines: 78 "Martin J. Bligh" wrote: > > >> > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on > >> > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10 > >> > > hours the system hit OOM, and OOM keep killing processes one by one. I > >> > > could reproduce this problem very constantly on a 2 way PIII 700MHZ with > >> > > 512MB RAM. Also the problem could be reproduced on running the same test > >> > > on reiser fs. > >> > > > >> > > The fsx command is: > >> > > > >> > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 & > >> > > >> > > >> > This ext3 bug goes all the way back to 2.6.6. > >> > >> > I don't know yet why you saw problems with reiser3 and I'm pretty sure I > >> > saw problems with ext2. More testing is needed there. > >> > > >> > >> We (Janet and I) are chasing this bug as well. Janet is able to > >> reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down > >> this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will > >> double check it. Will try your patch today. > > > > There's a second leak, with similar-looking symptoms. At ~50 > > commits/second it has leaked ~10MB in 24 hours, so it's very slow - less > > than a hundredth the rate of the first one. > > What are you using to see these with, just kgdb, and a large cranial > capacity? Or is there some more magic? > Nothing magical: run the test for a while, kill everything, cause a huge swapstorm then look at the meminfo numbers. If active+inactive is significantly larger than cahed+buffers+swapcached+mapped+minus-a-bit then it's leaked. Right now I have: MemTotal: 246264 kB MemFree: 196148 kB Buffers: 4200 kB Cached: 3308 kB SwapCached: 8064 kB Active: 21548 kB Inactive: 12532 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 246264 kB LowFree: 196148 kB SwapTotal: 1020116 kB SwapFree: 1001612 kB Dirty: 60 kB Writeback: 0 kB Mapped: 2284 kB Slab: 12200 kB CommitLimit: 1143248 kB Committed_AS: 34004 kB PageTables: 1200 kB VmallocTotal: 774136 kB VmallocUsed: 82832 kB VmallocChunk: 691188 kB HugePages_Total: 0 HugePages_Free: 0 33 megs on the LRU, unaccounted for in other places. Once the leak is nice and large I can start a new swapstorm, set a breakpoint in try_to_free_buffers() (for example) and start looking at the state of the page and its buffers. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/