Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756184Ab2HAWfk (ORCPT ); Wed, 1 Aug 2012 18:35:40 -0400 Received: from earth.cora.nwra.com ([4.28.99.180]:59422 "EHLO mail.cora.nwra.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1756075Ab2HAWfi (ORCPT ); Wed, 1 Aug 2012 18:35:38 -0400 X-Greylist: delayed 626 seconds by postgrey-1.27 at vger.kernel.org; Wed, 01 Aug 2012 18:35:38 EDT Message-ID: <5019ACC6.1040501@cora.nwra.com> Date: Wed, 01 Aug 2012 16:25:10 -0600 From: Orion Poplawski User-Agent: Mozilla/5.0 (X11; Linux i686; rv:13.0) Gecko/20120615 Thunderbird/13.0.1 MIME-Version: 1.0 To: linux-kernel@vger.kernel.org Subject: Need help debugging crazy kernel memory issue Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10963 Lines: 248 I recently started experiencing crashes (every 1-2 days) on one of my ScientificLinux 6.2 boxes. It appears that the machine runs out of memory, but the memory report makes no sense. I'm seeing it with each of the following kernels: kernel-2.6.32-220.17.1.el6.x86_64 kernel-2.6.32-220.23.1.el6.x86_64 kernel-2.6.32-279.1.1.el6.x86_64 kernel-2.6.32-220.17.1.el6.x86_64 had run fine for a long time, but reverting to it has not solved the problem. I have tried to revert all other updates from around the time the problems started to no avail. It seems like everything about the memory config goes screwy. Here's an info dump from normal operation (48GB RAM, 8GB swap): SysRq : Show Memory Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 CPU 2: hi: 0, btch: 1 usd: 0 CPU 3: hi: 0, btch: 1 usd: 0 CPU 4: hi: 0, btch: 1 usd: 0 CPU 5: hi: 0, btch: 1 usd: 0 CPU 6: hi: 0, btch: 1 usd: 0 CPU 7: hi: 0, btch: 1 usd: 0 CPU 8: hi: 0, btch: 1 usd: 0 CPU 9: hi: 0, btch: 1 usd: 0 CPU 10: hi: 0, btch: 1 usd: 0 CPU 11: hi: 0, btch: 1 usd: 0 CPU 12: hi: 0, btch: 1 usd: 0 CPU 13: hi: 0, btch: 1 usd: 0 CPU 14: hi: 0, btch: 1 usd: 0 CPU 15: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 180 CPU 1: hi: 186, btch: 31 usd: 158 CPU 2: hi: 186, btch: 31 usd: 0 CPU 3: hi: 186, btch: 31 usd: 0 CPU 4: hi: 186, btch: 31 usd: 0 CPU 5: hi: 186, btch: 31 usd: 0 CPU 6: hi: 186, btch: 31 usd: 0 CPU 7: hi: 186, btch: 31 usd: 0 CPU 8: hi: 186, btch: 31 usd: 0 CPU 9: hi: 186, btch: 31 usd: 19 CPU 10: hi: 186, btch: 31 usd: 30 CPU 11: hi: 186, btch: 31 usd: 0 CPU 12: hi: 186, btch: 31 usd: 0 CPU 13: hi: 186, btch: 31 usd: 0 CPU 14: hi: 186, btch: 31 usd: 0 CPU 15: hi: 186, btch: 31 usd: 0 Node 0 Normal per-cpu: CPU 0: hi: 186, btch: 31 usd: 144 CPU 1: hi: 186, btch: 31 usd: 124 CPU 2: hi: 186, btch: 31 usd: 182 CPU 3: hi: 186, btch: 31 usd: 162 CPU 4: hi: 186, btch: 31 usd: 182 CPU 5: hi: 186, btch: 31 usd: 162 CPU 6: hi: 186, btch: 31 usd: 0 CPU 7: hi: 186, btch: 31 usd: 0 CPU 8: hi: 186, btch: 31 usd: 138 CPU 9: hi: 186, btch: 31 usd: 119 CPU 10: hi: 186, btch: 31 usd: 88 CPU 11: hi: 186, btch: 31 usd: 75 CPU 12: hi: 186, btch: 31 usd: 0 CPU 13: hi: 186, btch: 31 usd: 183 CPU 14: hi: 186, btch: 31 usd: 179 CPU 15: hi: 186, btch: 31 usd: 159 Node 1 Normal per-cpu: CPU 0: hi: 186, btch: 31 usd: 184 CPU 1: hi: 186, btch: 31 usd: 168 CPU 2: hi: 186, btch: 31 usd: 164 CPU 3: hi: 186, btch: 31 usd: 0 CPU 4: hi: 186, btch: 31 usd: 41 CPU 5: hi: 186, btch: 31 usd: 172 CPU 6: hi: 186, btch: 31 usd: 126 CPU 7: hi: 186, btch: 31 usd: 145 CPU 8: hi: 186, btch: 31 usd: 63 CPU 9: hi: 186, btch: 31 usd: 158 CPU 10: hi: 186, btch: 31 usd: 162 CPU 11: hi: 186, btch: 31 usd: 165 CPU 12: hi: 186, btch: 31 usd: 48 CPU 13: hi: 186, btch: 31 usd: 64 CPU 14: hi: 186, btch: 31 usd: 66 CPU 15: hi: 186, btch: 31 usd: 165 active_anon:1356095 inactive_anon:839 isolated_anon:0 active_file:93487 inactive_file:384620 isolated_file:0 unevictable:0 dirty:65 writeback:0 unstable:0 free:10288450 slab_reclaimable:40068 slab_unreclaimable:34613 mapped:9236 shmem:474 pagetables:6530 bounce:0 Node 0 DMA free:15396kB min:36kB low:44kB high:52kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14984kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 2991 24201 24201 Node 0 DMA32 free:2563944kB min:8092kB low:10112kB high:12136kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3063392kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 21210 21210 Node 0 Normal free:16895636kB min:57372kB low:71712kB high:86056kB active_anon:2986668kB inactive_anon:92kB active_file:278460kB inactive_file:1316128kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:21719040kB mlocked:0kB dirty:120kB writeback:0kB mapped:18948kB shmem:360kB slab_reclaimable:123620kB slab_unreclaimable:95556kB kernel_stack:7024kB pagetables:12872kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:21678824kB min:65568kB low:81960kB high:98352kB active_anon:2437712kB inactive_anon:3264kB active_file:95488kB inactive_file:222352kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:24821760kB mlocked:0kB dirty:140kB writeback:0kB mapped:17996kB shmem:1536kB slab_reclaimable:36652kB slab_unreclaimable:42896kB kernel_stack:696kB pagetables:13248kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB 2*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15396kB Node 0 DMA32: 6*4kB 12*8kB 3*16kB 6*32kB 8*64kB 16*128kB 6*256kB 7*512kB 8*1024kB 6*2048kB 619*4096kB = 2563944kB Node 0 Normal: 1832*4kB 1354*8kB 846*16kB 583*32kB 272*64kB 65*128kB 70*256kB 53*512kB 37*1024kB 2*2048kB 4085*4096kB = 16895280kB Node 1 Normal: 509*4kB 2146*8kB 1166*16kB 831*32kB 346*64kB 197*128kB 200*256kB 202*512kB 192*1024kB 131*2048kB 5114*4096kB = 21678276kB 478555 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 8388600kB Total swap = 8388600kB 12582896 pages RAM 227482 pages reserved 450152 pages shared 1712399 pages non-shared and here's the initial oom and Mem-Info message: lvm invoked oom-killer: gfp_mask=0x201d0, order=0, oom_adj=0, oom_score_adj=0 lvm cpuset=/ mems_allowed=0 Pid: 3405, comm: lvm Not tainted 2.6.32-279.1.1.el6.x86_64 #1 Call Trace: [] ? cpuset_print_task_mems_allowed+0x91/0xb0 [] ? dump_header+0x90/0x1b0 [] ? security_real_capable_noaudit+0x3c/0x70 [] ? oom_kill_process+0x82/0x2a0 [] ? select_bad_process+0xe1/0x120 [] ? out_of_memory+0x220/0x3c0 [] ? blkdev_get_block+0x0/0x70 [] ? __alloc_pages_nodemask+0x89e/0x940 [] ? alloc_pages_current+0xaa/0x110 [] ? __page_cache_alloc+0x87/0x90 [] ? find_get_page+0x1e/0xa0 [] ? do_read_cache_page+0x4b/0x180 [] ? blkdev_readpage+0x0/0x20 [] ? read_cache_page_async+0x19/0x20 [] ? read_cache_page+0xe/0x20 [] ? read_dev_sector+0x30/0xa0 [] ? amiga_partition+0x6d/0x460 [] ? read_cache_page_async+0x19/0x20 [] ? read_dev_sector+0x30/0xa0 [] ? osf_partition+0x6c/0x120 [] ? rescan_partitions+0x1a7/0x470 [] ? __blkdev_get+0x1b6/0x3c0 [] ? blkdev_open+0x0/0xc0 [] ? blkdev_get+0x10/0x20 [] ? blkdev_open+0x71/0xc0 [] ? __dentry_open+0x10a/0x360 [] ? selinux_inode_permission+0x72/0xb0 [] ? security_inode_permission+0x1f/0x30 [] ? nameidata_to_filp+0x54/0x70 [] ? do_filp_open+0x6c0/0xd60 [] ? alloc_fd+0x92/0x160 [] ? do_sys_open+0x69/0x140 [] ? sys_open+0x20/0x30 [] ? system_call_fastpath+0x16/0x1b Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 42, btch: 7 usd: 23 active_anon:49 inactive_anon:97 isolated_anon:0 active_file:0 inactive_file:0 isolated_file:0 unevictable:3846 dirty:0 writeback:0 unstable:0 free:412 slab_reclaimable:1194 slab_unreclaimable:5681 mapped:356 shmem:0 pagetables:31 bounce:0 Node 0 DMA free:224kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:328kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 125 125 125 Node 0 DMA32 free:1424kB min:1428kB low:1784kB high:2140kB active_anon:196kB inactive_anon:388kB active_file:0kB inactive_file:0kB unevictable:15384kB isolated(anon):0kB isolated(file):0kB present:128256kB mlocked:0kB dirty:0kB writeback:0kB mapped:1424kB shmem:0kB slab_reclaimable:4776kB slab_unreclaimable:22724kB kernel_stack:600kB pagetables:124kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 0*4kB 2*8kB 1*16kB 2*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 224kB Node 0 DMA32: 0*4kB 2*8kB 2*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1424kB45035 3846 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 45035 pages RAM 16585 pages reserved 359 pages shared 23771 pages non-shared Only 1 cpu listed, memory numbers are incredibly small (180MB of RAM?! - no wonder it is out of memory), no swap, no Normal nodes listed, etc. It's fairly consistent, I see the same # of pages RAM each time it crashes. I ran memcheck through test #4 with no errors. cmdline: ro root=/dev/mapper/vg_root-root rd_MD_UUID=486b6486:65829f41:e3ccc1e2:ace1579a rd_LVM_LV=vg_root/root rd_LVM_LV=vg_root/swap rd_NO_LUKS rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=512M-2G:64M,2G-:128M console=tty0 console=ttyS0,115200 Any ideas? I'm at a loss. -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA, Boulder Office FAX: 303-415-9702 3380 Mitchell Lane orion@nwra.com Boulder, CO 80301 http://www.nwra.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/