Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755683Ab3CPLUY (ORCPT ); Sat, 16 Mar 2013 07:20:24 -0400 Received: from server.atrad.com.au ([150.101.241.2]:47600 "EHLO server.atrad.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753827Ab3CPLUX (ORCPT ); Sat, 16 Mar 2013 07:20:23 -0400 X-Greylist: delayed 1072 seconds by postgrey-1.27 at vger.kernel.org; Sat, 16 Mar 2013 07:20:22 EDT Date: Sat, 16 Mar 2013 21:32:11 +1030 From: Jonathan Woithe To: Raymond Jennings Cc: Hillf Danton , David Rientjes , Linux-MM , LKML , Jonathan Woithe Subject: Re: OOM triggered with plenty of memory free Message-ID: <20130316110211.GA30445@marvin.atrad.com.au> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5089 Lines: 130 On Sat, Mar 16, 2013 at 02:33:23AM -0700, Raymond Jennings wrote: > On Sat, Mar 16, 2013 at 2:25 AM, Hillf Danton wrote: > >> Some system specifications: > >> - CPU: i7 860 at 2.8 GHz > >> - Mainboard: Advantech AIMB-780 > >> - RAM: 4 GB > >> - Kernel: 2.6.35.11 SMP, 32 bit (kernel.org kernel, no patches applied) > > > The highmem no longer holds memory with 64-bit kernel. > > I don't really think that's a valid reason to dismiss problems with > 32-bit though, as I still use it myself. > > Anyway, to the parent poster, could you tell us more, such as how much > ram you had left free? > > A printout of /proc/meminfo might help here. Sure. Here is the contents of /proc/meminfo as it was just before the machine was rebooted: MemTotal: 3048988 kB MemFree: 1930548 kB Buffers: 0 kB Cached: 56876 kB SwapCached: 0 kB Active: 78016 kB Inactive: 53500 kB Active(anon): 57220 kB Inactive(anon): 22888 kB Active(file): 20796 kB Inactive(file): 30612 kB Unevictable: 127172 kB Mlocked: 127172 kB HighTotal: 2194952 kB HighFree: 1923040 kB LowTotal: 854036 kB LowFree: 7508 kB SwapTotal: 8393924 kB SwapFree: 8393924 kB Dirty: 52 kB Writeback: 684 kB AnonPages: 202204 kB Mapped: 25208 kB Shmem: 2600 kB Slab: 818868 kB SReclaimable: 6240 kB SUnreclaim: 812628 kB KernelStack: 2608 kB PageTables: 1388 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 9918416 kB Committed_AS: 433632 kB VmallocTotal: 122880 kB VmallocUsed: 24952 kB VmallocChunk: 56908 kB DirectMap4k: 16376 kB DirectMap4M: 892928 kB "free" reported this: total used free shared buffers cached Mem: 3048988 1101120 1947868 0 0 48780 -/+ buffers/cache: 1052340 1996648 Swap: 8393924 0 8393924 Earlier posts in this thread (at that point only in linux-mm) concentrated on the /proc/slabinfo output which was retrieved from a similar system to the faulting one (with only about 100 days uptime this was not yet OOMing), This included the following: kmalloc-128 1234556 1235168 128 32 1 : tunables 0 0 0 : slabdata 38599 38599 0 kmalloc-64 1238117 1238144 64 64 1 : tunables 0 0 0 : slabdata 19346 19346 0 kmalloc-32 1236600 1236608 32 128 1 : tunables 0 0 0 : slabdata 9661 9661 0 which pointed to a kernel memory leak. This was subsequently confirmed using kmemleak which threw many detections similar to the following example: unreferenced object 0xf5d3b500 (size 128): comm "udevd", pid 1382, jiffies 4294676664 (age 15504.596s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [] kmemleak_alloc+0x2c/0x60 [] kmem_cache_alloc+0xaa/0x130 [] prepare_creds+0x2c/0xb0 [] copy_creds+0x7d/0x1f0 [] copy_process+0x23a/0xe00 [] do_fork+0x83/0x3a0 [] sys_clone+0x34/0x40 [] ptregs_clone+0x15/0x1c [] 0xffffffff Dave Hansen then noted: > Your kmemleak data shows that the leaks are always from either 'struct > cred', or 'struct pid'. Those are _generally_ tied to tasks, but you > only have a couple thousand task_structs. > > My suspicion would be that something is allocating those structures, but > a refcount got leaked somewhere. For details refer to past posts in this thread to linux-mm. At this point I was able to test 3.7.9 (the latest stable available then) and the above leak did not appear to be occurring. 3.4.x and 3.0.x are also ok, so it seems that somewhere between 2.6.35.11 and 3.0 it went away. [An aside: unfortunately 3.7.9 has an unrelated bug in the network card driver we're using (introduced in 3.3) which hits us in other ways. A git bisect has isolated the offending commit, but until that's fixed we can't move to anything newer than a 3.2 kernel.] Since it's relatively easy to tell whether the memory leak is present using kmemleak and I now have access to some off-line hardware to permit testing, I am thinking of running a git bisect to see if I can identify which commit fixed the leak. Even if it turns out to be of academic interest only, it would be good to know that it was fixed rather than somehow being avoided for the moment due to another change. Let me know if there's anything more I could do. Regards jonathan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/