LinuxLists.cc - Bug: Apparent memory exhaustion WTF?

2011-04-01 20:17:12

Subject: Bug: Apparent memory exhaustion WTF?

Hello, guys. My server keeps hanging up with the error output as seen on the
bottom of this message. I run non-patched vanilla kernel. Full system upgrade
didn't help. Kernel upgrade from 2.6.28.7 to 2.6.35.10 didn't help. I tried
also turning swap on/off, upgrading memory and adding new CPU. After the CPU
upgrade things actually went worse, now my system blows up every few hours.

Here you see the squid process running out of memory, but nothing changed when
I killed squid; some other process would always cause this error. Just before a
hangup, all processes are being killed, as if OOM killer went wild. But the OOM
killer shouldn't disable the network stack, right?

After a hangup, the system becomes completely unresponsive, doesn't answer on
ping and even on ARP requests. The only thing that still works is the serial
console from which I get the following error. This message is being printed
continuously, one per second, for infinity. The only thing I can do with my
server is to turn off the power.

Tell me, guys, what to do? The server is 2x Intel dual core, 16GB RAM, HP
Proliant. It has RAID-1 disk and a Broadcom network adapter, which is the most
suspicious for me. Attached: lspci, /proc/meminfo, /proc/cpuinfo, kernel config
and the actual error message from the serial console.

[83155.708165] squid: page allocation failure. order:0, mode:0x4020
[83155.718040] Pid: 19999, comm: squid Not tainted 2.6.35.10-server #1
[83155.718040] Call Trace:
[83155.718040] [<c0175d67>] ? 0xc0175d67
[83155.718040] [<c019acdf>] ? 0xc019acdf
[83155.718040] [<c019b275>] ? 0xc019b275
[83155.718040] [<c030fe0b>] ? 0xc030fe0b
[83155.718040] [<c030fe0b>] ? 0xc030fe0b
[83155.718040] [<c030f838>] ? 0xc030f838
[83155.718040] [<c030fe0b>] ? 0xc030fe0b
[83155.718040] [<f81444f5>] ? 0xf81444f5
[83155.718040] [<f8142858>] ? 0xf8142858
[83155.718040] [<c0317a6a>] ? 0xc0317a6a
[83155.718040] [<c0137e77>] ? 0xc0137e77
[83155.718040] [<c0137df0>] ? 0xc0137df0
[83155.718040] <IRQ> [<c0137c75>] ? 0xc0137c75
[83155.718040] [<c0119ec3>] ? 0xc0119ec3
[83155.718040] [<c0296760>] ? 0xc0296760
[83155.718040] [<c039480a>] ? 0xc039480a
[83155.718040] [<c01336e5>] ? 0xc01336e5
[83155.718040] [<c01030a9>] ? 0xc01030a9
[83155.718040] [<c0392135>] ? 0xc0392135
[83155.718040] [<c024495a>] ? 0xc024495a
[83155.718040] [<c0172e93>] ? 0xc0172e93
[83155.718040] [<c0243eea>] ? 0xc0243eea
[83155.718040] [<c0172d58>] ? 0xc0172d58
[83155.718040] [<c0172ff3>] ? 0xc0172ff3
[83155.718040] [<c01731b3>] ? 0xc01731b3
[83155.718040] [<c0173274>] ? 0xc0173274
[83155.718040] [<c0175e99>] ? 0xc0175e99
[83155.718040] [<c0175ec4>] ? 0xc0175ec4
[83155.718040] [<c01af563>] ? 0xc01af563
[83155.718040] [<c0345992>] ? 0xc0345992
[83155.718040] [<c030805c>] ? 0xc030805c
[83155.718040] [<c01af7d7>] ? 0xc01af7d7
[83155.718040] [<c01af4c0>] ? 0xc01af4c0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01af5b0>] ? 0xc01af5b0
[83155.718040] [<c01aee68>] ? 0xc01aee68
[83155.718040] [<c01afbaa>] ? 0xc01afbaa
[83155.718040] [<c039441d>] ? 0xc039441d
[83155.718040] [<c0390000>] ? 0xc0390000
[83155.718040] Mem-Info:
[83155.718040] DMA per-cpu:
[83155.718040] CPU 0: hi: 0, btch: 1 usd: 0
[83155.718040] CPU 1: hi: 0, btch: 1 usd: 0
[83155.718040] CPU 2: hi: 0, btch: 1 usd: 0
[83155.718040] CPU 3: hi: 0, btch: 1 usd: 0
[83155.718040] Normal per-cpu:
[83155.718040] CPU 0: hi: 186, btch: 31 usd: 51
[83155.718040] CPU 1: hi: 186, btch: 31 usd: 114
[83155.718040] CPU 2: hi: 186, btch: 31 usd: 57
[83155.718040] CPU 3: hi: 186, btch: 31 usd: 150
[83155.718040] HighMem per-cpu:
[83155.718040] CPU 0: hi: 186, btch: 31 usd: 21
[83155.718040] CPU 1: hi: 186, btch: 31 usd: 0
[83155.718040] CPU 2: hi: 186, btch: 31 usd: 162
[83155.718040] CPU 3: hi: 186, btch: 31 usd: 0
[83155.718040] active_anon:110770 inactive_anon:10570 isolated_anon:0
[83155.718040] active_file:9886 inactive_file:29412 isolated_file:0
[83155.718040] unevictable:0 dirty:0 writeback:0 unstable:0
[83155.718040] free:3806991 slab_reclaimable:2051 slab_unreclaimable:184327
[83155.718040] mapped:3786 shmem:21 pagetables:503 bounce:0
[83155.718040] DMA free:3472kB min:64kB low:80kB high:96kB active_anon:0kB
inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:15848kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:8636kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[83155.718040] lowmem_reserve[]: 0 863 16233 16233
[83155.718040] Normal free:1336kB min:3724kB low:4652kB high:5584kB
active_anon:0kB inactive_anon:0kB active_file:1340kB inactive_file:1396kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:883912kB
mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB
slab_reclaimable:8204kB slab_unreclaimable:728672kB kernel_stack:472kB
pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
[83155.718040] lowmem_reserve[]: 0 0 122959 122959
[83155.718040] HighMem free:15223156kB min:512kB low:17092kB high:33676kB
active_anon:443080kB inactive_anon:42280kB active_file:38204kB
inactive_file:116252kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:15738868kB mlocked:0kB dirty:0kB writeback:0kB mapped:15140kB
shmem:84kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB
pagetables:2012kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
[83155.718040] lowmem_reserve[]: 0 0 0 0
[83155.718040] DMA: 48*4kB 2*8kB 0*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB
1*1024kB 1*2048kB 0*4096kB = 3472kB
[83155.718040] Normal: 178*4kB 0*8kB 1*16kB 1*32kB 1*64kB 0*128kB 0*256kB
1*512kB 0*1024kB 0*2048kB 0*4096kB = 1336kB
[83155.718040] HighMem: 30030*4kB 16829*8kB 11995*16kB 8071*32kB 3885*64kB
1150*128kB 230*256kB 50*512kB 17*1024kB 6*2048kB 3420*4096kB = 15223280kB
[83155.718040] 39319 total pagecache pages
[83155.718040] 0 pages in swap cache
[83155.718040] Swap cache stats: add 0, delete 0, find 0/0
[83155.718040] Free swap = 0kB
[83155.718040] Total swap = 0kB
[83155.718040] 4390911 pages RAM
[83155.718040] 4164098 pages HighMem
[83155.718040] 232489 pages reserved
[83155.718040] 45783 pages shared
[83155.718040] 314665 pages non-shared
[83155.718040] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
[83155.718040] cache: kmalloc-512, object size: 512, buffer size: 512,
default order: 1, min order: 0
[83155.718040] node 0: slabs: 147, objs: 2184, free: 0

Attachments:

config.txt (53.78 kB)
cpuinfo.txt (3.00 kB)
lspci.txt (59.41 kB)
meminfo.txt (1.09 kB)
error.txt (53.67 kB)
Download all attachments

2011-04-01 20:22:28

by Randy Dunlap

[permalink] [raw]

Subject: Re: Bug: Apparent memory exhaustion WTF?

On Fri, 01 Apr 2011 22:17:01 +0200 haael wrote:

>
> Hello, guys. My server keeps hanging up with the error output as seen on the
> bottom of this message. I run non-patched vanilla kernel. Full system upgrade
> didn't help. Kernel upgrade from 2.6.28.7 to 2.6.35.10 didn't help. I tried
> also turning swap on/off, upgrading memory and adding new CPU. After the CPU
> upgrade things actually went worse, now my system blows up every few hours.
>
> Here you see the squid process running out of memory, but nothing changed when
> I killed squid; some other process would always cause this error. Just before a
> hangup, all processes are being killed, as if OOM killer went wild. But the OOM
> killer shouldn't disable the network stack, right?
>
> After a hangup, the system becomes completely unresponsive, doesn't answer on
> ping and even on ARP requests. The only thing that still works is the serial
> console from which I get the following error. This message is being printed
> continuously, one per second, for infinity. The only thing I can do with my
> server is to turn off the power.
>
> Tell me, guys, what to do? The server is 2x Intel dual core, 16GB RAM, HP
> Proliant. It has RAID-1 disk and a Broadcom network adapter, which is the most
> suspicious for me. Attached: lspci, /proc/meminfo, /proc/cpuinfo, kernel config
> and the actual error message from the serial console.

It would really help if you would build with KALLSYMS enabled (=y)
so that the stack trace below was meaningful/useful.

# CONFIG_KALLSYMS is not set

> [83155.708165] squid: page allocation failure. order:0, mode:0x4020
> [83155.718040] Pid: 19999, comm: squid Not tainted 2.6.35.10-server #1
> [83155.718040] Call Trace:
> [83155.718040] [<c0175d67>] ? 0xc0175d67
> [83155.718040] [<c019acdf>] ? 0xc019acdf
> [83155.718040] [<c019b275>] ? 0xc019b275
> [83155.718040] [<c030fe0b>] ? 0xc030fe0b
> [83155.718040] [<c030fe0b>] ? 0xc030fe0b
> [83155.718040] [<c030f838>] ? 0xc030f838
> [83155.718040] [<c030fe0b>] ? 0xc030fe0b
> [83155.718040] [<f81444f5>] ? 0xf81444f5
> [83155.718040] [<f8142858>] ? 0xf8142858
> [83155.718040] [<c0317a6a>] ? 0xc0317a6a
> [83155.718040] [<c0137e77>] ? 0xc0137e77
> [83155.718040] [<c0137df0>] ? 0xc0137df0
> [83155.718040] <IRQ> [<c0137c75>] ? 0xc0137c75
> [83155.718040] [<c0119ec3>] ? 0xc0119ec3
> [83155.718040] [<c0296760>] ? 0xc0296760
> [83155.718040] [<c039480a>] ? 0xc039480a
> [83155.718040] [<c01336e5>] ? 0xc01336e5
> [83155.718040] [<c01030a9>] ? 0xc01030a9
> [83155.718040] [<c0392135>] ? 0xc0392135
> [83155.718040] [<c024495a>] ? 0xc024495a
> [83155.718040] [<c0172e93>] ? 0xc0172e93
> [83155.718040] [<c0243eea>] ? 0xc0243eea
> [83155.718040] [<c0172d58>] ? 0xc0172d58
> [83155.718040] [<c0172ff3>] ? 0xc0172ff3
> [83155.718040] [<c01731b3>] ? 0xc01731b3
> [83155.718040] [<c0173274>] ? 0xc0173274
> [83155.718040] [<c0175e99>] ? 0xc0175e99
> [83155.718040] [<c0175ec4>] ? 0xc0175ec4
> [83155.718040] [<c01af563>] ? 0xc01af563
> [83155.718040] [<c0345992>] ? 0xc0345992
> [83155.718040] [<c030805c>] ? 0xc030805c
> [83155.718040] [<c01af7d7>] ? 0xc01af7d7
> [83155.718040] [<c01af4c0>] ? 0xc01af4c0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01af5b0>] ? 0xc01af5b0
> [83155.718040] [<c01aee68>] ? 0xc01aee68
> [83155.718040] [<c01afbaa>] ? 0xc01afbaa
> [83155.718040] [<c039441d>] ? 0xc039441d
> [83155.718040] [<c0390000>] ? 0xc0390000

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

2011-04-02 15:20:59

by Phil Turmel

[permalink] [raw]

Subject: Re: Bug: Apparent memory exhaustion WTF?

On 04/01/2011 04:17 PM, haael wrote:
>
> Hello, guys. My server keeps hanging up with the error output as seen on the bottom of this message. I run non-patched vanilla kernel. Full system upgrade didn't help. Kernel upgrade from 2.6.28.7 to 2.6.35.10 didn't help. I tried also turning swap on/off, upgrading memory and adding new CPU. After the CPU upgrade things actually went worse, now my system blows up every few hours.
>
> Here you see the squid process running out of memory, but nothing changed when I killed squid; some other process would always cause this error. Just before a hangup, all processes are being killed, as if OOM killer went wild. But the OOM killer shouldn't disable the network stack, right?
>
> After a hangup, the system becomes completely unresponsive, doesn't answer on ping and even on ARP requests. The only thing that still works is the serial console from which I get the following error. This message is being printed continuously, one per second, for infinity. The only thing I can do with my server is to turn off the power.
>
> Tell me, guys, what to do? The server is 2x Intel dual core, 16GB RAM, HP Proliant. It has RAID-1 disk and a Broadcom network adapter, which is the most suspicious for me. Attached: lspci, /proc/meminfo, /proc/cpuinfo, kernel config and the actual error message from the serial console.

I think this is part of your problem:

> # CONFIG_64BIT is not set
> CONFIG_X86_32=y
> # CONFIG_X86_64 is not set
> CONFIG_X86=y

Everything I've seen on these lists says 32bit kernels and large amounts of RAM are a bad combination. As I understand it, the information the kernel needs to track all of the memory above 4G must fit in kernel memory below 4G, which is usually just the 1G under the kernel/user split.

> [83155.718040] Free swap = 0kB
> [83155.718040] Total swap = 0kB

And no swap to fall back on. Which might be OK on a 64bit kernel.

Just my $0.02.

Phil