2007-06-13 08:49:51

by John Sigler

[permalink] [raw]
Subject: Runaway process and oom-killer

Linux version 2.6.20.7-rt8 (root@wagram) (gcc version 3.4.3) #1 PREEMPT Fri Apr 20 12:44:16 CEST 2007
BIOS-provided physical RAM map:
sanitize start
sanitize end
copy_e820_map() start: 0000000000000000 size: 00000000000a0000 end: 00000000000a0000 type: 1
copy_e820_map() type is E820_RAM
copy_e820_map() start: 00000000000f0000 size: 0000000000010000 end: 0000000000100000 type: 2
copy_e820_map() start: 0000000000100000 size: 000000000fef0000 end: 000000000fff0000 type: 1
copy_e820_map() type is E820_RAM
copy_e820_map() start: 000000000fff0000 size: 0000000000003000 end: 000000000fff3000 type: 4
copy_e820_map() start: 000000000fff3000 size: 000000000000d000 end: 0000000010000000 type: 3
copy_e820_map() start: 00000000ffff0000 size: 0000000000010000 end: 0000000100000000 type: 2
BIOS-e820: 0000000000000000 - 00000000000a0000 (usable)
BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000000fff0000 (usable)
BIOS-e820: 000000000fff0000 - 000000000fff3000 (ACPI NVS)
BIOS-e820: 000000000fff3000 - 0000000010000000 (ACPI data)
BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
255MB LOWMEM available.
Entering add_active_range(0, 0, 65520) 0 entries of 256 used
Zone PFN ranges:
DMA 0 -> 4096
Normal 4096 -> 65520
early_node_map[1] active PFN ranges
0: 0 -> 65520
On node 0 totalpages: 65520
DMA zone: 32 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 4064 pages, LIFO batch:0
Normal zone: 479 pages used for memmap
Normal zone: 60945 pages, LIFO batch:15
DMI 2.3 present.
ACPI: RSDP (v000 VIA601 ) @ 0x000f6950
ACPI: RSDT (v001 VIA601 AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x0fff3000
ACPI: FADT (v001 VIA601 AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x0fff3040
ACPI: DSDT (v001 VIA601 AWRDACPI 0x00001000 MSFT 0x0100000c) @ 0x00000000
ACPI: PM-Timer IO Port: 0x4008
Allocating PCI resources starting at 20000000 (gap: 10000000:efff0000)
Detected 1266.700 MHz processor.
Real-Time Preemption Support (C) 2004-2007 Ingo Molnar
Built 1 zonelists. Total pages: 65009
Kernel command line: ro root=/dev/hda2 console=ttyS0,57600n8 console=tty0 panic=3 lpj=12667000
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
WARNING: experimental RCU implementation.
PID hash table entries: 1024 (order: 10, 4096 bytes)
Console: colour dummy device 80x25
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Memory: 256680k/262080k available (1735k kernel code, 4916k reserved, 584k data, 168k init, 0k highmem)
virtual kernel memory layout:
fixmap : 0xffff8000 - 0xfffff000 ( 28 kB)
vmalloc : 0xd0800000 - 0xffff6000 ( 759 MB)
lowmem : 0xc0000000 - 0xcfff0000 ( 255 MB)
.init : 0xc0348000 - 0xc0372000 ( 168 kB)
.data : 0xc02b1fe8 - 0xc0344238 ( 584 kB)
.text : 0xc0100000 - 0xc02b1fe8 (1735 kB)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay loop (skipped)... 2533.40 BogoMIPS preset
Mount-cache hash table entries: 512
CPU: After generic identify, caps: 0383f9ff 00000000 00000000 00000000 00000000 00000000 00000000
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 512K
CPU: After all inits, caps: 0383f9ff 00000000 00000000 00000040 00000000 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: Intel(R) Pentium(R) III CPU - S 1266MHz stepping 04
Checking 'hlt' instruction... OK.
ACPI: Core revision 20060707
ACPI: setting ELCR to 0080 (from 1ca0)
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: PCI BIOS revision 2.10 entry at 0xfb210, last bus=2
PCI: Using configuration type 1
Setting up standard PCI resources
ACPI: Interpreter enabled
ACPI: Using PIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Probing PCI hardware (bus 00)
ACPI: Assume root bridge [\_SB_.PCI0] bus is 0
0000:00:07.1: cannot adjust BAR0 (not I/O)
0000:00:07.1: cannot adjust BAR1 (not I/O)
0000:00:07.1: cannot adjust BAR2 (not I/O)
0000:00:07.1: cannot adjust BAR3 (not I/O)
PCI quirk: region 6000-607f claimed by vt82c686 HW-mon
PCI quirk: region 5000-500f claimed by vt82c686 SMB
PCI: Firmware left 0000:00:08.0 e100 interrupts enabled, disabling
PCI: Firmware left 0000:00:09.0 e100 interrupts enabled, disabling
PCI: Firmware left 0000:00:0a.0 e100 interrupts enabled, disabling
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 1 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 1 3 4 5 6 7 10 11 *12 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 1 3 4 5 6 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 1 3 4 *5 6 7 10 11 12 14 15)
Linux Plug and Play Support v0.97 (c) Adam Belay
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report
PCI: Bridge: 0000:00:01.0
IO window: disabled.
MEM window: disabled.
PREFETCH window: disabled.
PCI: Bridge: 0000:00:0d.0
IO window: disabled.
MEM window: e5300000-e53fffff
PREFETCH window: disabled.
PCI: Setting latency timer of device 0000:00:01.0 to 64
NET: Registered protocol family 2
IP route cache hash table entries: 2048 (order: 1, 8192 bytes)
TCP established hash table entries: 8192 (order: 6, 262144 bytes)
TCP bind hash table entries: 4096 (order: 4, 114688 bytes)
TCP: Hash tables configured (established 8192 bind 4096)
TCP reno registered
io scheduler noop registered (default)
input: Power Button (FF) as /class/input/input0
ACPI: Power Button (FF) [PWRF]
input: Power Button (CM) as /class/input/input1
ACPI: Power Button (CM) [PWRB]
Real Time Clock Driver v1.12ac
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
serial8250: ttyS2 at I/O 0x3e8 (irq = 0) is a 16550A
RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
eepro100.c:v1.09j-t 9/29/99 Donald Becker http://www.scyld.com/network/eepro100.html
eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin <[email protected]> and others
ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 11
PCI: setting IRQ 11 as level-triggered
ACPI: PCI Interrupt 0000:00:08.0[A] -> Link [LNKA] -> GSI 11 (level, low) -> IRQ 11
eth0: 0000:00:08.0, 00:30:64:05:69:CF, IRQ 11.
Board assembly 721383-016, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x04f4518b).
ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 12
PCI: setting IRQ 12 as level-triggered
ACPI: PCI Interrupt 0000:00:09.0[A] -> Link [LNKB] -> GSI 12 (level, low) -> IRQ 12
eth1: 0000:00:09.0, 00:30:64:05:69:D0, IRQ 12.
Board assembly 721383-016, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x04f4518b).
ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10
PCI: setting IRQ 10 as level-triggered
ACPI: PCI Interrupt 0000:00:0a.0[A] -> Link [LNKC] -> GSI 10 (level, low) -> IRQ 10
eth2: 0000:00:0a.0, 00:30:64:05:69:D1, IRQ 10.
Board assembly 721383-016, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x04f4518b).
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
Probing IDE interface ide0...
hda: PQI IDE DiskOnModule, ATA DISK drive
Probing IDE interface ide1...
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: max request size: 128KiB
hda: 256000 sectors (131 MB) w/1KiB Cache, CHS=500/16/32
hda: hda1 hda2 hda3 hda4
PNP: No PS/2 controller found. Probing ports directly.
serio: i8042 KBD port at 0x60,0x64 irq 1
mice: PS/2 mouse device common for all mice
netem: version 1.2
Netfilter messages via NETLINK v0.30.
nf_conntrack version 0.5.0 (2047 buckets, 16376 max)
ctnetlink v0.93: registering with nfnetlink.
ip_tables: (C) 2000-2006 Netfilter Core Team
ClusterIP Version 0.8 loaded successfully
arp_tables: (C) 2002 David S. Miller
TCP cubic registered
NET: Registered protocol family 1
NET: Registered protocol family 17
802.1Q VLAN Support v1.8 Ben Greear <[email protected]>
All bugs added by David S. Miller <[email protected]>
Using IPI Shortcut mode
VFS: Mounted root (ext2 filesystem) readonly.
Time: tsc clocksource has been installed.
Switched to high resolution mode on CPU 0
Freeing unused kernel memory: 168k freed
process `syslogd' is using obsolete setsockopt SO_BSDCOMPAT
ACPI: PCI Interrupt 0000:02:0e.0[A] -> Link [LNKC] -> GSI 10 (level, low) -> IRQ 10
Dta1xx: The DTA-105 (Firmware Version 0) has started successfully.
Dta1xx: Found 1 DTA-1xx boards in the system
Dta1xx: Major: 254
Dta1xx: The driver (Rev 1.11.2.105) has loaded successfully.
watchsoftware.s invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
[<c013fd7e>] out_of_memory+0xde/0x110
[<c0141102>] __alloc_pages+0x292/0x300
[<c013c461>] find_get_page+0x91/0x170
[<c0128520>] wake_bit_function+0x0/0x60
[<c0142e41>] __do_page_cache_readahead+0xa1/0x140
[<c013d749>] filemap_nopage+0x169/0x390
[<c01499ca>] do_no_page+0x6a/0x2a0
[<c0149fbf>] __handle_mm_fault+0x25f/0x2b0
[<c010e847>] do_page_fault+0x317/0x5c0
[<c010e530>] do_page_fault+0x0/0x5c0
[<c02b1394>] error_code+0x74/0x7c
[<c02b0033>] cond_resched_softirq_context+0x43/0xd0
=======================
Mem-info:
DMA per-cpu:
CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: Hot: hi: 90, btch: 15 usd: 53 Cold: hi: 30, btch: 7 usd: 9
Active:61612 inactive:137 dirty:0 writeback:0 unstable:0 free:744 slab:1336 mapped:82 pagetables:117
DMA free:1076kB min:124kB low:152kB high:184kB active:12244kB inactive:124kB present:16256kB pages_scanned:21628 all_unreclaimable? yes
lowmem_reserve[]: 0 238
Normal free:1900kB min:1908kB low:2384kB high:2860kB active:234204kB inactive:424kB present:243780kB pages_scanned:359621 all_unreclaimable? yes
lowmem_reserve[]: 0 0
DMA: 1*4kB 0*8kB 1*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1076kB
Normal: 1*4kB 1*8kB 2*16kB 0*32kB 3*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 1900kB
Free swap: 0kB
klogd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
[<c013fd7e>] out_of_memory+0xde/0x110
[<c0141102>] __alloc_pages+0x292/0x300
[<c013c461>] find_get_page+0x91/0x170
[<c0142e41>] __do_page_cache_readahead+0xa1/0x140
[<c013d749>] filemap_nopage+0x169/0x390
[<c01499ca>] do_no_page+0x6a/0x2a0
[<c0149fbf>] __handle_mm_fault+0x25f/0x2b0
[<c01284d0>] autoremove_wake_function+0x0/0x50
[<c010e847>] do_page_fault+0x317/0x5c0
[<c01577a1>] sys_read+0x41/0x70
[<c010e530>] do_page_fault+0x0/0x5c0
[<c02b1394>] error_code+0x74/0x7c
=======================
Mem-info:
DMA per-cpu:
CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: Hot: hi: 90, btch: 15 usd: 53 Cold: hi: 30, btch: 7 usd: 9
Active:61612 inactive:137 dirty:0 writeback:0 unstable:0 free:744 slab:1336 mapped:82 pagetables:117
DMA free:1076kB min:124kB low:152kB high:184kB active:12244kB inactive:124kB present:16256kB pages_scanned:21628 all_unreclaimable? yes
lowmem_reserve[]: 0 238
Normal free:1900kB min:1908kB low:2384kB high:2860kB active:234204kB inactive:424kB present:243780kB pages_scanned:359621 all_unreclaimable? yes
lowmem_reserve[]: 0 0
DMA: 1*4kB 0*8kB 1*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1076kB
Normal: 1*4kB 1*8kB 2*16kB 0*32kB 3*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 1900kB
Free swap: 0kB
tail invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
[<c013fd7e>] out_of_memory+0xde/0x110
[<c0141102>] __alloc_pages+0x292/0x300
[<c013c461>] find_get_page+0x91/0x170
[<c0128520>] wake_bit_function+0x0/0x60
[<c0142e41>] __do_page_cache_readahead+0xa1/0x140
[<c013d749>] filemap_nopage+0x169/0x390
[<c01499ca>] do_no_page+0x6a/0x2a0
[<c0149fbf>] __handle_mm_fault+0x25f/0x2b0
[<c012b883>] hrtimer_cancel+0x13/0x20
[<c010e847>] do_page_fault+0x317/0x5c0
[<c010e530>] do_page_fault+0x0/0x5c0
[<c02b1394>] error_code+0x74/0x7c
[<c02b0033>] cond_resched_softirq_context+0x43/0xd0
=======================
Mem-info:
DMA per-cpu:
CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: Hot: hi: 90, btch: 15 usd: 53 Cold: hi: 30, btch: 7 usd: 9
Active:61612 inactive:137 dirty:0 writeback:0 unstable:0 free:744 slab:1336 mapped:82 pagetables:117
DMA free:1076kB min:124kB low:152kB high:184kB active:12244kB inactive:124kB present:16256kB pages_scanned:21628 all_unreclaimable? yes
lowmem_reserve[]: 0 238
Normal free:1900kB min:1908kB low:2384kB high:2860kB active:234204kB inactive:424kB present:243780kB pages_scanned:359666 all_unreclaimable? yes
lowmem_reserve[]: 0 0
DMA: 1*4kB 0*8kB 1*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1076kB
Normal: 1*4kB 1*8kB 2*16kB 0*32kB 3*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 1900kB
Free swap: 0kB
65520 pages of RAM
0 pages of HIGHMEM
1283 reserved pages
855 pages shared
0 pages swap cached
0 pages dirty
0 pages writeback
82 pages mapped
1336 pages slab
117 pages pagetables
Out of memory: kill process 1023 (ControlTask) score 149 or a child
Killed process 1141 (myapp)
65520 pages of RAM
0 pages of HIGHMEM
1283 reserved pages
793 pages shared
0 pages swap cached
0 pages dirty
0 pages writeback
79 pages mapped
1259 pages slab
61 pages pagetables
Out of memory: kill process 853 (syslogd) score 134 or a child
Killed process 853 (syslogd)
65520 pages of RAM
0 pages of HIGHMEM
1283 reserved pages
818 pages shared
0 pages swap cached
0 pages dirty
0 pages writeback
85 pages mapped
1259 pages slab
61 pages pagetables
Out of memory: kill process 1021 (boa) score 130 or a child
Killed process 1021 (boa)
root@RCV:/var/log#


Attachments:
dmesg.txt (14.14 kB)

2007-06-13 09:08:46

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Runaway process and oom-killer

On Wed, Jun 13, 2007 at 10:49:29AM +0200, John Sigler wrote:
> Question 1: why isn't this output picked up by klogd and sent to the
> appropriate file (kern.log in my case)?

I guess not a question for lkml, most distros uses syslog-ng anyway.

> Question 2: how can I tell which process or kernel thread was hogging
> most of the RAM when the oom-killer kicked in?

Theoretically the one that was killed first but not for sure in
current mainline hence see below.

> Question 3: if myapp was hogging the RAM, then why did oom-killer snipe
> other processes (syslogd and boa) after it killed myapp?

Please try again after applying this monolith (proper submission was
on linux-mm but the monolith is easier to apply for tests like this):

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.22-rc4/oom-fixes-1

2007-06-13 09:20:59

by John Sigler

[permalink] [raw]
Subject: Re: Runaway process and oom-killer

Andrea Arcangeli wrote:

> On Wed, Jun 13, 2007 at 10:49:29AM +0200, John Sigler wrote:
>
>> Question 2: how can I tell which process or kernel thread was hogging
>> most of the RAM when the oom-killer kicked in?
>
> Theoretically the one that was killed first but not for sure in
> current mainline hence see below.

If I read the logs correctly, oom-killer is "invoked" three times before
it effectively kills a process. Then oom-killer kills myapp, syslogd,
and boa, in that order. Why didn't oom-killer kill anything the first
three times?

>> Question 3: if myapp was hogging the RAM, then why did oom-killer snipe
>> other processes (syslogd and boa) after it killed myapp?
>
> Please try again after applying this monolith (proper submission was
> on linux-mm but the monolith is easier to apply for tests like this):
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.22-rc4/oom-fixes-1

Would this patch apply to a 2.6.20.7-rt8 source tree?

Regards.

2007-06-13 13:42:29

by Helge Hafting

[permalink] [raw]
Subject: Re: Runaway process and oom-killer

John Sigler wrote:
> Andrea Arcangeli wrote:
>
>> On Wed, Jun 13, 2007 at 10:49:29AM +0200, John Sigler wrote:
>>
>>> Question 2: how can I tell which process or kernel thread was
>>> hogging most of the RAM when the oom-killer kicked in?
>>
>> Theoretically the one that was killed first but not for sure in
>> current mainline hence see below.
>
> If I read the logs correctly, oom-killer is "invoked" three times
> before it effectively kills a process. Then oom-killer kills myapp,
> syslogd, and boa, in that order. Why didn't oom-killer kill anything
> the first three times?
My guess:
Something needs memory but finds there is none to be had
oom-killer is invoked and targets myapp.
myapp takes some time to die. Particularly, the memory it uses
isn't freed up instantly. In the meantime something else
needs memory and find none. (Another packet received?)
The oom-killer is invoked again, this time it targets syslogd.

And so on. The kernel do many things in parallel, running out
of memory in a multitasking system therefore is a complicated business.
Especially when process killing takes some time.

Note that you can turn off memory overcommit, your leaky app
should then get a memory allocation error instead of
triggering the oom-killer.

Helge Hafting

2007-06-13 15:11:04

by Chris Friesen

[permalink] [raw]
Subject: Re: Runaway process and oom-killer

Helge Hafting wrote:

> My guess:
> Something needs memory but finds there is none to be had
> oom-killer is invoked and targets myapp.
> myapp takes some time to die. Particularly, the memory it uses
> isn't freed up instantly.

Has anyone considered actually bumping up the priority of the task being
killed so that it gets to run and free up its resources in a timely manner?

We've done some experimenting with actually putting it in SCHED_RR and
it seems to help (in the case of other busy SCHED_RR tasks on the
system). Admittedly we have an older kernel, so behaviour may be
different now.

Chris

2007-06-13 15:26:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Runaway process and oom-killer

On Wed, Jun 13, 2007 at 09:10:33AM -0600, Chris Friesen wrote:
> Helge Hafting wrote:
>
> >My guess:
> >Something needs memory but finds there is none to be had
> >oom-killer is invoked and targets myapp.
> >myapp takes some time to die. Particularly, the memory it uses
> >isn't freed up instantly.
>
> Has anyone considered actually bumping up the priority of the task being
> killed so that it gets to run and free up its resources in a timely manner?
>
> We've done some experimenting with actually putting it in SCHED_RR and
> it seems to help (in the case of other busy SCHED_RR tasks on the
> system). Admittedly we have an older kernel, so behaviour may be
> different now.

Replacing the timeslice boost with sched-rr would probably be an ok
cleanup, but I'm unsure if it risks to break the other RT tasks
(especially when folks misuse a soft-RT design to do hard-RT work,
with userland drivers running with RT privilege mapping iommu
regions).

Most cpus goes idle as soon as the oom killer is entered so it
probably doesn't make much difference anyway. I didn't remove the
timeslice hack just because there wasn't any compelling reason to
remove it but it's probably not making much difference.

2007-06-14 09:43:21

by John Sigler

[permalink] [raw]
Subject: Re: Runaway process and oom-killer

Helge Hafting wrote:

> John Sigler wrote:
>
>> Andrea Arcangeli wrote:
>>
>>> On Wed, Jun 13, 2007 at 10:49:29AM +0200, John Sigler wrote:
>>>
>>>> Question 2: how can I tell which process or kernel thread was
>>>> hogging most of the RAM when the oom-killer kicked in?
>>>
>>> Theoretically the one that was killed first but not for sure in
>>> current mainline hence see below.
>>
>> If I read the logs correctly, oom-killer is "invoked" three times
>> before it effectively kills a process. Then oom-killer kills myapp,
>> syslogd, and boa, in that order. Why didn't oom-killer kill anything
>> the first three times?
>
> My guess:
> Something needs memory but finds there is none to be had
> oom-killer is invoked and targets myapp.
> myapp takes some time to die. Particularly, the memory it uses
> isn't freed up instantly. In the meantime something else
> needs memory and find none. (Another packet received?)

Possibly. In fact, myapp receives a 40 Mbit/s stream.

> The oom-killer is invoked again, this time it targets syslogd.

I went hunting, and found a memory leak in our syslogd. Doh!

> And so on. The kernel do many things in parallel, running out
> of memory in a multitasking system therefore is a complicated business.
> Especially when process killing takes some time.

I didn't mention that there is no swap on this system.

> Note that you can turn off memory overcommit, your leaky app
> should then get a memory allocation error instead of
> triggering the oom-killer.

Are you referring to these /proc/sys/vm entries?

# cat /proc/sys/vm/overcommit_memory
0
# cat /proc/sys/vm/overcommit_ratio
50

Are you suggesting I set overcommit_memory to 2?

The manual states:

/proc/sys/vm/overcommit_memory

This file contains the kernel virtual memory accounting mode.
Values are:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
In mode 0, calls of mmap(2) with MAP_NORESERVE set are not checked, and
the default check is very weak, leading to the risk of getting a process
"OOM-killed". Under Linux 2.4 any non-zero value implies mode 1. In mode
2 (available since Linux 2.6), the total virtual address space on the
system is limited to (SS + RAM*(r/100)), where SS is the size of the
swap space, and RAM is the size of the physical memory, and r is the
contents of the file /proc/sys/vm/overcommit_ratio.

In my case, SS=0 and RAM=256MB.

If I understand correctly, if I set ratio to 50, then processes can only
address 128MB. I'd be, in effect, reserving 128MB for the kernel, right?

Are there other entries in /proc/sys/vm I should be playing with? :-)

/proc/sys/vm/block_dump
0
/proc/sys/vm/dirty_background_ratio
10
/proc/sys/vm/dirty_expire_centisecs
3000
/proc/sys/vm/dirty_ratio
40
/proc/sys/vm/dirty_writeback_centisecs
500
/proc/sys/vm/drop_caches
0
/proc/sys/vm/laptop_mode
0
/proc/sys/vm/legacy_va_layout
0
/proc/sys/vm/lowmem_reserve_ratio
256
/proc/sys/vm/max_map_count
65536
/proc/sys/vm/min_free_kbytes
2039
/proc/sys/vm/nr_pdflush_threads
2
/proc/sys/vm/overcommit_memory
0
/proc/sys/vm/overcommit_ratio
50
/proc/sys/vm/page-cluster
3
/proc/sys/vm/panic_on_oom
0
/proc/sys/vm/percpu_pagelist_fraction
0
/proc/sys/vm/swappiness
60
/proc/sys/vm/vdso_enabled
1
/proc/sys/vm/vfs_cache_pressure
100

Regards.

2007-06-14 10:03:46

by John Sigler

[permalink] [raw]
Subject: Re: Runaway process and oom-killer

Chris Friesen wrote:

> Helge Hafting wrote:
>
>> My guess:
>> Something needs memory but finds there is none to be had
>> oom-killer is invoked and targets myapp.
>> myapp takes some time to die. Particularly, the memory it uses
>> isn't freed up instantly.
>
> Has anyone considered actually bumping up the priority of the task being
> killed so that it gets to run and free up its resources in a timely manner?

On this system,
myapp runs in SCHED_RR with priority 80.
IRQ handlers run in SCHED_FIFO with priority 50.

# ps -eo comm,class,rtprio,ni,pri --sort -rtprio
COMMAND CLS RTPRIO NI PRI
posix_cpu_timer FF 99 - 139
myapp RR 80 - 120
softirq-high/0 FF 50 - 90
softirq-timer/0 FF 50 - 90
softirq-net-tx/ FF 50 - 90
softirq-net-rx/ FF 50 - 90
softirq-block/0 FF 50 - 90
softirq-tasklet FF 50 - 90
softirq-sched/0 FF 50 - 90
softirq-hrtimer FF 50 - 90
softirq-rcu/0 FF 50 - 90
IRQ-7 FF 50 - 90
IRQ-8 FF 50 - 90
IRQ-14 FF 50 - 90
IRQ-12 FF 50 - 90
IRQ-1 FF 50 - 90
IRQ-10 FF 50 - 90
IRQ-11 FF 50 - 90
IRQ-5 FF 50 - 90
IRQ-3 FF 50 - 90
IRQ-4 FF 50 - 90
events/0 FF 1 - 41
init TS - 0 24
desched/0 TS - -10 34
khelper TS - -5 29
kthread TS - -5 27
kblockd/0 TS - -5 21
kacpid TS - -5 19
kseriod TS - -5 29
pdflush TS - 0 17
pdflush TS - 0 24
kswapd0 TS - -5 23
flush_filesd/0 TS - -5 29
aio/0 TS - -5 22
syslogd TS - 0 21
klogd TS - 0 21
sshd TS - 0 21
acpid TS - 0 16
agetty TS - 0 24
agetty TS - 0 21
agetty TS - 0 21
agetty TS - 0 21
[...]

How do the scheduling class and priority of the process come into play
when the kernel comes to reclaim memory after the oom-killer has decided
to snipe that particular process?

> We've done some experimenting with actually putting it in SCHED_RR and
> it seems to help (in the case of other busy SCHED_RR tasks on the
> system). Admittedly we have an older kernel, so behaviour may be
> different now.

Thanks for sharing your experience.

Regards.

2007-06-14 13:09:38

by Helge Hafting

[permalink] [raw]
Subject: Re: Runaway process and oom-killer

John Sigler wrote:
>>
>> My guess:
>> Something needs memory but finds there is none to be had
>> oom-killer is invoked and targets myapp.
>> myapp takes some time to die. Particularly, the memory it uses
>> isn't freed up instantly. In the meantime something else
>> needs memory and find none. (Another packet received?)
>
> Possibly. In fact, myapp receives a 40 Mbit/s stream.
>
>> The oom-killer is invoked again, this time it targets syslogd.
>
> I went hunting, and found a memory leak in our syslogd. Doh!
Good for you! :-) If this is standard syslogd, please send a patch
to the maintainer.
>
>> And so on. The kernel do many things in parallel, running out
>> of memory in a multitasking system therefore is a complicated business.
>> Especially when process killing takes some time.
>
> I didn't mention that there is no swap on this system.
It doesn't matter that much. With swap, you get a slowdown
as memory is used up, and eventually oom-kill when swap too runs out.
With swap, you run at full speed until you hit the wall.

Swapping is nice on a desktop machine, because:
* Light swapping is not a problem, so you can do work that
would otherwise be impossible
* Heavy swapping is noticeable, so you close some apps in a
civilized manner, instead of having the oom-killer
surprising you.

This may or may not apply to your RT setup.

>> Note that you can turn off memory overcommit, your leaky app
>> should then get a memory allocation error instead of
>> triggering the oom-killer.
>
> Are you referring to these /proc/sys/vm entries?
>
> # cat /proc/sys/vm/overcommit_memory
> 0
> # cat /proc/sys/vm/overcommit_ratio
> 50
>
> Are you suggesting I set overcommit_memory to 2?
Yes. You can never trust the oom-killer to do the "right thing"
in all circumstances - as you discovered. It can probably be improved on,
but there will be corner cases anyway.
When you use overcommit, the oom-killer is the alternative to having
the machine halt when it can't keep the promise of more memory.
> The manual states:
>
> /proc/sys/vm/overcommit_memory
>
> This file contains the kernel virtual memory accounting mode.
> Values are:
> 0: heuristic overcommit (this is the default)
> 1: always overcommit, never check
> 2: always check, never overcommit
> In mode 0, calls of mmap(2) with MAP_NORESERVE set are not checked,
> and the default check is very weak, leading to the risk of getting a
> process "OOM-killed". Under Linux 2.4 any non-zero value implies mode
> 1. In mode 2 (available since Linux 2.6), the total virtual address
> space on the system is limited to (SS + RAM*(r/100)), where SS is the
> size of the swap space, and RAM is the size of the physical memory,
> and r is the contents of the file /proc/sys/vm/overcommit_ratio.
>
> In my case, SS=0 and RAM=256MB.
>
> If I understand correctly, if I set ratio to 50, then processes can
> only address 128MB. I'd be, in effect, reserving 128MB for the kernel,
> right?
>
I think so. You might want to modify "r" depending on how much your kernel
actually uses.

Note that not overcommitting tends to run out of memory very fast,
because of the way fork() is used. When a program forks,
it becomes two processes that initially shares all memory.
That means they don't use more memory after the fork().

But the memory will be spent as soon as one process changes
its memory, for the change is not supposed to be seen in
the other process. Without overcommit then, the kernel will
allow the fork() only if there is enough memory left to
copy the entire process being forked. This is often
unnecessary in practice - if you have several shells running then
they share lots of memory that they never actually will need
to duplicate. Overcommit is therefore popular, it lets
you run much more software as long as it behaves this way.

Another solution: Set memory_overcommit to 2. And get some swap.
The swap will never actually be _used_, unless you get so many
allocations that the oom-killer would have stepped in without swap.

You can then monitor swapping. If anything is written to swap,
then you're using too much memory and should investigate.
For production, this is usually better than cleaning up after
the oom-killer busted some random processes.


Helge Hafting