[permalink] [raw]

Subject: Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2

Hi!

> > Is it highmem-related? Can you try it with mem=256M?
>
> Bad idea, the kernel crashes & burns when I use mem=256, I had to boot
> 2.6.20-rc5-6 single to get back into my machine, very nasty. Remember I
> use an onboard graphics controller that has 128MB of RAM allocated to it
> and I believe the ICH8 chipset also uses some memory, in any event mem=256
> causes the machine to lockup before it can even get to the boot/init
> processes, the two leds on the keyboard were blinking, caps lock and
> scroll lock and I saw no console at all!

Okay, so try mem=700M or disable CONFIG_HIGHMEM or something.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-01-25 00:34:36

by Justin Piszcz

[permalink] [raw]

Subject: Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2

There is some XFS stuff in the dmesg too, that is why I am continuing to
include the XFS mailing list. Scroll down to read more.

On Mon, 22 Jan 2007, Andrew Morton wrote:

> > On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz <[email protected]> wrote:
> > Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke
> > the OOM killer and kill all of my processes?
>
> What's that? Software raid or hardware raid? If the latter, which driver?
>
> > Doing this on a single disk 2.6.19.2 is OK, no issues. However, this
> > happens every time!
> >
> > Anything to try? Any other output needed? Can someone shed some light on
> > this situation?
> >
> > Thanks.
> >
> >
> > The last lines of vmstat 1 (right before it kill -9'd my shell/ssh)
> >
> > procs -----------memory---------- ---swap-- -----io---- -system--
> > ----cpu----
> > r b swpd free buff cache si so bi bo in cs us sy id
> > wa
> > 0 7 764 50348 12 1269988 0 0 53632 172 1902 4600 1 8
> > 29 62
> > 0 7 764 49420 12 1260004 0 0 53632 34368 1871 6357 2 11
> > 48 40
>
> The wordwrapping is painful :(
>
> >
> > The last lines of dmesg:
> > [ 5947.199985] lowmem_reserve[]: 0 0 0
> > [ 5947.199992] DMA: 0*4kB 1*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB
> > 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3544kB
> > [ 5947.200010] Normal: 1*4kB 0*8kB 1*16kB 1*32kB 0*64kB 1*128kB 0*256kB
> > 1*512kB 0*1024kB 1*2048kB 0*4096kB = 2740kB
> > [ 5947.200035] HighMem: 98*4kB 35*8kB 9*16kB 69*32kB 4*64kB 1*128kB
> > 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3664kB
> > [ 5947.200052] Swap cache: add 789, delete 189, find 16/17, race 0+0
> > [ 5947.200055] Free swap = 2197628kB
> > [ 5947.200058] Total swap = 2200760kB
> > [ 5947.200060] Free swap: 2197628kB
> > [ 5947.205664] 517888 pages of RAM
> > [ 5947.205671] 288512 pages of HIGHMEM
> > [ 5947.205673] 5666 reserved pages
> > [ 5947.205675] 257163 pages shared
> > [ 5947.205678] 600 pages swap cached
> > [ 5947.205680] 88876 pages dirty
> > [ 5947.205682] 115111 pages writeback
> > [ 5947.205684] 5608 pages mapped
> > [ 5947.205686] 49367 pages slab
> > [ 5947.205688] 541 pages pagetables
> > [ 5947.205795] Out of memory: kill process 1853 (named) score 9937 or a
> > child
> > [ 5947.205801] Killed process 1853 (named)
> > [ 5947.206616] bash invoked oom-killer: gfp_mask=0x84d0, order=0,
> > oomkilladj=0
> > [ 5947.206621] [<c013e33b>] out_of_memory+0x17b/0x1b0
> > [ 5947.206631] [<c013fcac>] __alloc_pages+0x29c/0x2f0
> > [ 5947.206636] [<c01479ad>] __pte_alloc+0x1d/0x90
> > [ 5947.206643] [<c0148bf7>] copy_page_range+0x357/0x380
> > [ 5947.206649] [<c0119d75>] copy_process+0x765/0xfc0
> > [ 5947.206655] [<c012c3f9>] alloc_pid+0x1b9/0x280
> > [ 5947.206662] [<c011a839>] do_fork+0x79/0x1e0
> > [ 5947.206674] [<c015f91f>] do_pipe+0x5f/0xc0
> > [ 5947.206680] [<c0101176>] sys_clone+0x36/0x40
> > [ 5947.206686] [<c0103138>] syscall_call+0x7/0xb
> > [ 5947.206691] [<c0420033>] __sched_text_start+0x853/0x950
> > [ 5947.206698] =======================
>
> Important information from the oom-killing event is missing. Please send
> it all.
>
> >From your earlier reports we have several hundred MB of ZONE_NORMAL memory
> which has gone awol.
>
> Please include /proc/meminfo from after the oom-killing.
>
> Please work out what is using all that slab memory, via /proc/slabinfo.
>
> After the oom-killing, please see if you can free up the ZONE_NORMAL memory
> via a few `echo 3 > /proc/sys/vm/drop_caches' commands. See if you can
> work out what happened to the missing couple-of-hundred MB from
> ZONE_NORMAL.
>
>

I have done all you said, and I ran a constant loop w/ vmstat & cat
/proc/slabinfo toward the end of the file(s) (_after_oom_killer) is when I
ran:

p34:~# echo 3 > /proc/sys/vm/drop_caches
p34:~# echo 3 > /proc/sys/vm/drop_caches
p34:~# echo 3 > /proc/sys/vm/drop_caches
p34:~#

The tarball will yield the following files you requested:

4.0K _proc_meminfo_after_oom_killing.txt
976K _slabinfo_after_oom_killer.txt
460K _slabinfo.txt
8.0K _vmstat_after_oom_killer.txt
4.0K _vmstat.txt

I am going back to 2.6.20-rc5-6 w/NO-PRE-EMPT, as this was about 10x more
stable in copying operations and everything else, if you need any more
diagnostics/crashing(s) in this manner, let me know, because I can make it
happen every single time with pre-empt on.

And not sure if it matters but when I copy 18gb 18g.2 I have seen it make
it to various stages:

Attempt size_the 18g.2 got to:

1. 7.8G
2. 4.3G
3. 1.2G

Back to 2.6.20-rc5 w/ no pre-emption, until then/or if/you request
anything else let me know. Also I ran echo t > /proc/sysrq-trigger

I have attached THE ENTIRE syslog/kernel dmesg/ring buffer so you'll see
my system booting up with no problems/errors up until now in which I am
going back to the old kernel. This is attached as
kernel_ring_buffer.txt.bz2

Andrew,

Please let me know if any of this helps!

Justin.

Attachments:

kernel_ring_buffer.txt.bz2 (32.44 kB)
preempt.tar.bz2 (20.01 kB)
Download all attachments

2007-01-25 00:36:21

by Justin Piszcz

[permalink] [raw]

On Thu, 25 Jan 2007, Mark Hahn wrote:

> > Something is seriously wrong with that OOM killer.
>
> do you know you don't have to operate in OOM-slaughter mode?
>
> "vm.overcommit_memory = 2" in your /etc/sysctl.conf puts you into a mode where
> the kernel tracks your "committed" memory needs, and will eventually cause
> some allocations to fail.
> this is often much nicer than the default random OOM slaughter.
> (you probably also need to adjust vm.overcommit_ratio with some knowlege of
> your MemTotal and SwapTotal.)
>
> regards, mark hahn.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

# sysctl -a | grep vm.over
vm.overcommit_ratio = 50
vm.overcommit_memory = 0

I'll have to experiment with these options, thanks for the info!

Justin.