2009-10-12 12:02:35

by Vedran Furač

[permalink] [raw]
Subject: Memory overcommit

Hi! I don't know if this is appropriate place to ask such questions and
if not, please point me to such place.

Let's simulate a process gone berserk with this piece of code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>

int main()
{
char *buf;
while(1) {
buf = malloc (1024*1024*100);
if ( buf == NULL ) {
perror("malloc");
getchar();
exit(EXIT_FAILURE);
}
sleep(1);
memset(buf, 1, 1024*1024*100);
}
return 0;
}

# echo 0 > /proc/sys/vm/overcommit_memory

Compile, run and soon result is:
- System freezes for a second or two
- OOMK wakes up
- X crashes

Now, I'm back to VT1 and dmesg shows 8 process were killed by OOMK
(including X server and some long running daemons with small memory
footprint like automount) before the real culprit was killed. This
random killing spree *really* gives bad reputation to linux and people
usually point this out as an argument against it.

But, there is an easy fix:
# echo 2 > /proc/sys/vm/overcommit_memory
Run the program again and after a few seconds you'll get:

"malloc: Cannot allocate memory"

and that's all what happens. Nothing gets killed and one (and others
too) can continue to work without loosing time, data or both. Only
somewhat strange is that kernel contradicts itself when it says there is
no more and in the same time saying:

/proc/meminfo
MemTotal: 3542532 kB
MemFree: 892972 kB
Buffers: 2664 kB
Cached: 130940 kB

...that there is almost 900MB free memory. But OK, I can live with it.

So, my question is: why today overcommit isn't turned off *by default*?
I have it turned off for a few years now and only side effect is that I
don't get processes killed randomly anymore, I don't loose valuable time
and data.

Regards,

Vedran


2009-10-13 03:11:49

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Memory overcommit

On Mon, 12 Oct 2009 13:51:07 +0200
Vedran Furač <[email protected]> wrote:
> /proc/meminfo
> MemTotal: 3542532 kB
> MemFree: 892972 kB
> Buffers: 2664 kB
> Cached: 130940 kB
>
> ...that there is almost 900MB free memory. But OK, I can live with it.
>
> So, my question is: why today overcommit isn't turned off *by default*?
> I have it turned off for a few years now and only side effect is that I
> don't get processes killed randomly anymore, I don't loose valuable time
> and data.
>
"isn't turned off" means "vm.overcommit_memory==2" ?
And...what's version your kernel is ?
oom-killer still finds "definitely-not-guilty" ones ?


I guess the reason of default value is that the kernel assume processes will
not always use all mmaped range. There will be unused range in process's virtual
memory and it can be big.

For example, typical case in a server,
when you run multi-thread program (like java VM),

- stack per thread
- malloc() arena per thread

can makes difference among size-of-mapped-range v.s. used-pages bigger.
I saw Gigabytes of unused range on ia64 host,...statck size was big.

IIUC, the size is determined by ulimit's stack size at default. it's 10M on
my x86-64 host.
You'll see 1G of commited usage when you run 100 no-op threads.

And if strict check(vm.ovecommit_memory=2) is used, mmap() return -ENOMEM
whenever it hits limit.
You have to find "which processs should be killed" by youself, anyway.


Against random-kill, you may have 2 choices.

1. use /proc/<pid>/oom_adj
2. use memory cgroup.

Something more easy-to-use method may be appriciated. We have above 2 now.

Thanks,
-Kame

2009-10-13 17:14:43

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

KAMEZAWA Hiroyuki wrote:

> On Mon, 12 Oct 2009 13:51:07 +0200 Vedran Furač
> <[email protected]> wrote:
>> /proc/meminfo MemTotal: 3542532 kB MemFree: 892972
>> kB Buffers: 2664 kB Cached: 130940 kB
>>
>> ...that there is almost 900MB free memory. But OK, I can live with
>> it.
>>
>> So, my question is: why today overcommit isn't turned off *by
>> default*? I have it turned off for a few years now and only side
>> effect is that I don't get processes killed randomly anymore, I
>> don't loose valuable time and data.
>>
> "isn't turned off" means "vm.overcommit_memory==2" ?

Yes, "2: always check, never overcommit" as per proc(5)

> And...what's version your kernel is ?

Applies to every 2.6.

> oom-killer still finds "definitely-not-guilty" ones ?

Yes. It's always repeatable. Just compile and run that code. I'll
probably just file a bug report.

> I guess the reason of default value is that the kernel assume
> processes will not always use all mmaped range. There will be unused
> range in process's virtual memory and it can be big.
>
> For example, typical case in a server, when you run multi-thread
> program (like java VM),
>
> - stack per thread - malloc() arena per thread
>
> can makes difference among size-of-mapped-range v.s. used-pages
> bigger. I saw Gigabytes of unused range on ia64 host,...statck size
> was big.

Yes, I noticed that JVM allocates gigabytes but then uses less than 10%
of that and, as a consequence, eclipse sometimes fails to start although
there's plenty of free memory. So overcommiting is some kind of a
workaround for broken software that allocate not what they need but what
they might need in some rare occurrences. I would rather like fixing
this userland software than risking OOM situations and random killing of
innocent processes.

> And if strict check(vm.ovecommit_memory=2) is used, mmap() return
> -ENOMEM whenever it hits limit.

% strace -f -e mmap java -version
[...]
mmap(NULL, 996147200, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot
allocate memory)

And that should be fine.

> Against random-kill, you may have 2 choices.
>
> 1. use /proc/<pid>/oom_adj 2. use memory cgroup.
>
> Something more easy-to-use method may be appriciated. We have above 2
> now.

These are just bad workarounds for bad OOM algorithm. I tested this
little program on multiple systems (including windows) without any
tweaking and linux behavior is, unfortunately *the worst*. :/


Regards,

Vedran

2009-10-14 04:54:25

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Memory overcommit

On Tue, 13 Oct 2009 19:13:34 +0200
Vedran Furač <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > I guess the reason of default value is that the kernel assume
> > processes will not always use all mmaped range. There will be unused
> > range in process's virtual memory and it can be big.
> >
> > For example, typical case in a server, when you run multi-thread
> > program (like java VM),
> >
> > - stack per thread - malloc() arena per thread
> >
> > can makes difference among size-of-mapped-range v.s. used-pages
> > bigger. I saw Gigabytes of unused range on ia64 host,...statck size
> > was big.
>
> Yes, I noticed that JVM allocates gigabytes but then uses less than 10%
> of that and, as a consequence, eclipse sometimes fails to start although
> there's plenty of free memory. So overcommiting is some kind of a
> workaround for broken software that allocate not what they need but what
> they might need in some rare occurrences. I would rather like fixing
> this userland software than risking OOM situations and random killing of
> innocent processes.
>

In my understanding, mmap() is just for requesting virtual address space.
Not for requesting memory in these days.



> > And if strict check(vm.ovecommit_memory=2) is used, mmap() return
> > -ENOMEM whenever it hits limit.
>
> % strace -f -e mmap java -version
> [...]
> mmap(NULL, 996147200, PROT_READ|PROT_WRITE|PROT_EXEC,
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot
> allocate memory)
>
> And that should be fine.
>
It's not fine for me ;)

> > Against random-kill, you may have 2 choices.
> >
> > 1. use /proc/<pid>/oom_adj 2. use memory cgroup.
> >
> > Something more easy-to-use method may be appriciated. We have above 2
> > now.
>
> These are just bad workarounds for bad OOM algorithm. I tested this
> little program on multiple systems (including windows) without any
> tweaking and linux behavior is, unfortunately *the worst*. :/
>
>
Yes, they are workaround. You can use /etc/sysctl.conf.
But if making it default _now_, many threaded programs will not work.

But I agree, OOM killer should be sophisticated.
Please give us a sample program/test case which causes problem.
[email protected] may be a better place. lkml has too much traffic.

Regards,
-Kame

2009-10-26 16:16:14

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

KAMEZAWA Hiroyuki wrote:

> Can I make more questions ?

Sure

> - What's cpu ?

vendor_id : AuthenticAMD


cpu family : 16


model : 4


model name : AMD Phenom(tm) II X3 720 Processor


stepping : 2


cpu MHz : 3314.812


cache size : 512 KB


> - How much memory ?
> - Do you have swap ?

total used free shared buffers cached
Mem: 3459 1452 2007 0 65 622
-/+ buffers/cache: 764 2695
Swap: 0 0 0

So, no swap. Don't need it.

> - What's the latest kernel version you tested?

2.6.30-2-amd64 #1 SMP (on Debian)

> - Could you show me /var/log/dmesg and /var/log/messages at OOM ?

It was catastrophe. :) X crashed (or killed) with all the programs, but
my little program was alive for 20 minutes (see timestamps). And for
that time computer was completely unusable. Couldn't even get the
console via ssh. Rally embarrassing for a modern OS to get destroyed by
a 5 lines of C run as an ordinary user. Luckily screen was still alive,
oomk usually kills it also. See for yourself:

dmesg: http://pastebin.com/f3f83738a
messages: http://pastebin.com/f2091110a

(CCing to lklm again... I just want people to see the logs.)

Regards,

Vedran

2009-10-27 03:24:52

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Memory overcommit

On Mon, 26 Oct 2009 17:16:14 +0100
Vedran Furač <[email protected]> wrote:
> > - Could you show me /var/log/dmesg and /var/log/messages at OOM ?
>
> It was catastrophe. :) X crashed (or killed) with all the programs, but
> my little program was alive for 20 minutes (see timestamps). And for
> that time computer was completely unusable. Couldn't even get the
> console via ssh. Rally embarrassing for a modern OS to get destroyed by
> a 5 lines of C run as an ordinary user. Luckily screen was still alive,
> oomk usually kills it also. See for yourself:
>
> dmesg: http://pastebin.com/f3f83738a
> messages: http://pastebin.com/f2091110a
>
> (CCing to lklm again... I just want people to see the logs.)
>
Thank you for reporting and your patience. It seems something strange
that your KDE programs are killed. I agree.

I attached a scirpt for checking oom_score of all exisiting process.
(oom_score is a value used for selecting "bad" processs.")
please run if you have time.

This is a result of my own desktop(on virtual machine.)
In this environ (Total memory is 1.6GBytes), mmap(1G) program is running.

%check_badness.pl | sort -n | tail
--
89924 3938 mixer_applet2
90210 3942 tomboy
94753 3936 clock-applet
101994 3919 pulseaudio
113525 4028 gnome-terminal
127340 1 init
128177 3871 nautilus
151003 11515 bash
256944 11653 mmap
425561 3829 gnome-session
--
Sigh, gnome-session has twice value of mmap(1G).
Of course, gnome-session only uses 6M bytes of anon.
I wonder this is because gnome-session has many children..but need to
dig more. Does anyone has idea ?
(CCed kosaki)

Thanks,
-Kame





Attachments:
check_badness.pl (313.00 B)

2009-10-27 06:10:52

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Memory overcommit

2009/10/27 KAMEZAWA Hiroyuki <[email protected]>:
> On Mon, 26 Oct 2009 17:16:14 +0100
> Vedran Furač <[email protected]> wrote:
>> >  - Could you show me /var/log/dmesg and /var/log/messages at OOM ?
>>
>> It was catastrophe. :) X crashed (or killed) with all the programs, but
>> my little program was alive for 20 minutes (see timestamps). And for
>> that time computer was completely unusable. Couldn't even get the
>> console via ssh. Rally embarrassing for a modern OS to get destroyed by
>> a 5 lines of C run as an ordinary user. Luckily screen was still alive,
>> oomk usually kills it also. See for yourself:
>>
>> dmesg: http://pastebin.com/f3f83738a
>> messages: http://pastebin.com/f2091110a
>>
>> (CCing to lklm again... I just want people to see the logs.)
>>
> Thank you for reporting and your patience. It seems something strange
> that your KDE programs are killed. I agree.
>
> I attached a scirpt for checking oom_score of all exisiting process.
> (oom_score is a value used for selecting "bad" processs.")
> please run if you have time.
>
> This is a result of my own desktop(on virtual machine.)
> In this environ (Total memory is 1.6GBytes), mmap(1G) program is running.
>
> %check_badness.pl | sort -n | tail
> --
> 89924   3938    mixer_applet2
> 90210   3942    tomboy
> 94753   3936    clock-applet
> 101994  3919    pulseaudio
> 113525  4028    gnome-terminal
> 127340  1       init
> 128177  3871    nautilus
> 151003  11515   bash
> 256944  11653   mmap
> 425561  3829    gnome-session
> --
> Sigh, gnome-session has twice value of mmap(1G).
> Of course, gnome-session only uses 6M bytes of anon.
> I wonder this is because gnome-session has many children..but need to
> dig more. Does anyone has idea ?
> (CCed kosaki)

Following output address the issue.
The fact is, modern desktop application linked pretty many library. it
makes bloat VSS size and increase
OOM score.

Ideally, We shouldn't account evictable file-backed mappings for oom_score.


# cat /proc/`pidof gnome-session`/maps
00400000-00433000 r-xp 00000000 fd:00 100061
/usr/bin/gnome-session
00632000-00637000 rw-p 00032000 fd:00 100061
/usr/bin/gnome-session
00949000-00a10000 rw-p 00000000 00:00 0 [heap]
34cf600000-34cf61f000 r-xp 00000000 fd:00 1088
/lib64/ld-2.10.1.so
34cf81e000-34cf81f000 r--p 0001e000 fd:00 1088
/lib64/ld-2.10.1.so
34cf81f000-34cf820000 rw-p 0001f000 fd:00 1088
/lib64/ld-2.10.1.so
34cfa00000-34cfb64000 r-xp 00000000 fd:00 1089
/lib64/libc-2.10.1.so
34cfb64000-34cfd64000 ---p 00164000 fd:00 1089
/lib64/libc-2.10.1.so
34cfd64000-34cfd68000 r--p 00164000 fd:00 1089
/lib64/libc-2.10.1.so
34cfd68000-34cfd69000 rw-p 00168000 fd:00 1089
/lib64/libc-2.10.1.so
34cfd69000-34cfd6e000 rw-p 00000000 00:00 0
34cfe00000-34cfe82000 r-xp 00000000 fd:00 1104
/lib64/libm-2.10.1.so
34cfe82000-34d0082000 ---p 00082000 fd:00 1104
/lib64/libm-2.10.1.so
34d0082000-34d0083000 r--p 00082000 fd:00 1104
/lib64/libm-2.10.1.so
34d0083000-34d0084000 rw-p 00083000 fd:00 1104
/lib64/libm-2.10.1.so
34d0200000-34d0202000 r-xp 00000000 fd:00 1095
/lib64/libdl-2.10.1.so
34d0202000-34d0402000 ---p 00002000 fd:00 1095
/lib64/libdl-2.10.1.so
34d0402000-34d0403000 r--p 00002000 fd:00 1095
/lib64/libdl-2.10.1.so
34d0403000-34d0404000 rw-p 00003000 fd:00 1095
/lib64/libdl-2.10.1.so
34d0600000-34d0617000 r-xp 00000000 fd:00 1090
/lib64/libpthread-2.10.1.so
34d0617000-34d0816000 ---p 00017000 fd:00 1090
/lib64/libpthread-2.10.1.so
34d0816000-34d0817000 r--p 00016000 fd:00 1090
/lib64/libpthread-2.10.1.so
34d0817000-34d0818000 rw-p 00017000 fd:00 1090
/lib64/libpthread-2.10.1.so
34d0818000-34d081c000 rw-p 00000000 00:00 0
34d0a00000-34d0a15000 r-xp 00000000 fd:00 1113
/lib64/libz.so.1.2.3
34d0a15000-34d0c14000 ---p 00015000 fd:00 1113
/lib64/libz.so.1.2.3
34d0c14000-34d0c15000 rw-p 00014000 fd:00 1113
/lib64/libz.so.1.2.3
34d0e00000-34d0e07000 r-xp 00000000 fd:00 1091
/lib64/librt-2.10.1.so
34d0e07000-34d1006000 ---p 00007000 fd:00 1091
/lib64/librt-2.10.1.so
34d1006000-34d1007000 r--p 00006000 fd:00 1091
/lib64/librt-2.10.1.so
34d1007000-34d1008000 rw-p 00007000 fd:00 1091
/lib64/librt-2.10.1.so
34d1200000-34d121c000 r-xp 00000000 fd:00 1097
/lib64/libselinux.so.1
34d121c000-34d141b000 ---p 0001c000 fd:00 1097
/lib64/libselinux.so.1
34d141b000-34d141c000 r--p 0001b000 fd:00 1097
/lib64/libselinux.so.1
34d141c000-34d141d000 rw-p 0001c000 fd:00 1097
/lib64/libselinux.so.1
34d141d000-34d141e000 rw-p 00000000 00:00 0
34d1600000-34d16dd000 r-xp 00000000 fd:00 1092
/lib64/libglib-2.0.so.0.2000.4
34d16dd000-34d18dc000 ---p 000dd000 fd:00 1092
/lib64/libglib-2.0.so.0.2000.4
34d18dc000-34d18de000 rw-p 000dc000 fd:00 1092
/lib64/libglib-2.0.so.0.2000.4
34d1a00000-34d1a41000 r-xp 00000000 fd:00 1094
/lib64/libgobject-2.0.so.0.2000.4
34d1a41000-34d1c41000 ---p 00041000 fd:00 1094
/lib64/libgobject-2.0.so.0.2000.4
34d1c41000-34d1c43000 rw-p 00041000 fd:00 1094
/lib64/libgobject-2.0.so.0.2000.4
34d1e00000-34d1e02000 r-xp 00000000 fd:00 1115
/usr/lib64/libXau.so.6.0.0
34d1e02000-34d2001000 ---p 00002000 fd:00 1115
/usr/lib64/libXau.so.6.0.0
34d2001000-34d2002000 rw-p 00001000 fd:00 1115
/usr/lib64/libXau.so.6.0.0
34d2200000-34d2203000 r-xp 00000000 fd:00 1096
/lib64/libgmodule-2.0.so.0.2000.4
34d2203000-34d2402000 ---p 00003000 fd:00 1096
/lib64/libgmodule-2.0.so.0.2000.4
34d2402000-34d2403000 rw-p 00002000 fd:00 1096
/lib64/libgmodule-2.0.so.0.2000.4
34d2600000-34d261a000 r-xp 00000000 fd:00 1116
/usr/lib64/libxcb.so.1.1.0
34d261a000-34d281a000 ---p 0001a000 fd:00 1116
/usr/lib64/libxcb.so.1.1.0
34d281a000-34d281b000 rw-p 0001a000 fd:00 1116
/usr/lib64/libxcb.so.1.1.0
34d2a00000-34d2b34000 r-xp 00000000 fd:00 1117
/usr/lib64/libX11.so.6.2.0
34d2b34000-34d2d33000 ---p 00134000 fd:00 1117
/usr/lib64/libX11.so.6.2.0
34d2d33000-34d2d39000 rw-p 00133000 fd:00 1117
/usr/lib64/libX11.so.6.2.0
34d2e00000-34d2e04000 r-xp 00000000 fd:00 1093
/lib64/libgthread-2.0.so.0.2000.4
34d2e04000-34d3003000 ---p 00004000 fd:00 1093
/lib64/libgthread-2.0.so.0.2000.4
34d3003000-34d3004000 rw-p 00003000 fd:00 1093
/lib64/libgthread-2.0.so.0.2000.4
34d3200000-34d3226000 r-xp 00000000 fd:00 1111
/lib64/libexpat.so.1.5.2
34d3226000-34d3425000 ---p 00026000 fd:00 1111
/lib64/libexpat.so.1.5.2
34d3425000-34d3428000 rw-p 00025000 fd:00 1111
/lib64/libexpat.so.1.5.2
34d3600000-34d3676000 r-xp 00000000 fd:00 1098
/lib64/libgio-2.0.so.0.2000.4
34d3676000-34d3875000 ---p 00076000 fd:00 1098
/lib64/libgio-2.0.so.0.2000.4
34d3875000-34d3877000 rw-p 00075000 fd:00 1098
/lib64/libgio-2.0.so.0.2000.4
34d3877000-34d3878000 rw-p 00000000 00:00 0
34d3a00000-34d3a93000 r-xp 00000000 fd:00 1110
/usr/lib64/libfreetype.so.6.3.20
34d3a93000-34d3c93000 ---p 00093000 fd:00 1110
/usr/lib64/libfreetype.so.6.3.20
34d3c93000-34d3c99000 rw-p 00093000 fd:00 1110
/usr/lib64/libfreetype.so.6.3.20
34d3e00000-34d3e04000 r-xp 00000000 fd:00 1141
/lib64/libattr.so.1.1.0
34d3e04000-34d4003000 ---p 00004000 fd:00 1141
/lib64/libattr.so.1.1.0
34d4003000-34d4004000 rw-p 00003000 fd:00 1141
/lib64/libattr.so.1.1.0
34d4200000-34d4211000 r-xp 00000000 fd:00 1123
/usr/lib64/libXext.so.6.4.0
34d4211000-34d4411000 ---p 00011000 fd:00 1123
/usr/lib64/libXext.so.6.4.0
34d4411000-34d4412000 rw-p 00011000 fd:00 1123
/usr/lib64/libXext.so.6.4.0
34d4600000-34d4604000 r-xp 00000000 fd:00 1142
/lib64/libcap.so.2.16
34d4604000-34d4803000 ---p 00004000 fd:00 1142
/lib64/libcap.so.2.16
34d4803000-34d4804000 rw-p 00003000 fd:00 1142
/lib64/libcap.so.2.16
34d4a00000-34d4a33000 r-xp 00000000 fd:00 1112
/usr/lib64/libfontconfig.so.1.4.1
34d4a33000-34d4c32000 ---p 00033000 fd:00 1112
/usr/lib64/libfontconfig.so.1.4.1
34d4c32000-34d4c34000 rw-p 00032000 fd:00 1112
/usr/lib64/libfontconfig.so.1.4.1
34d4e00000-34d4e25000 r-xp 00000000 fd:00 1114
/usr/lib64/libpng12.so.0.37.0
34d4e25000-34d5024000 ---p 00025000 fd:00 1114
/usr/lib64/libpng12.so.0.37.0
34d5024000-34d5025000 rw-p 00024000 fd:00 1114
/usr/lib64/libpng12.so.0.37.0
34d5200000-34d523c000 r-xp 00000000 fd:00 1143
/lib64/libdbus-1.so.3.4.0
34d523c000-34d543c000 ---p 0003c000 fd:00 1143
/lib64/libdbus-1.so.3.4.0
34d543c000-34d543d000 r--p 0003c000 fd:00 1143
/lib64/libdbus-1.so.3.4.0
34d543d000-34d543e000 rw-p 0003d000 fd:00 1143
/lib64/libdbus-1.so.3.4.0
34d5600000-34d5609000 r-xp 00000000 fd:00 1118
/usr/lib64/libXrender.so.1.3.0
34d5609000-34d5808000 ---p 00009000 fd:00 1118
/usr/lib64/libXrender.so.1.3.0
34d5808000-34d5809000 rw-p 00008000 fd:00 1118
/usr/lib64/libXrender.so.1.3.0
34d5a00000-34d5a2c000 r-xp 00000000 fd:00 1121
/usr/lib64/libpangoft2-1.0.so.0.2400.5
34d5a2c000-34d5c2b000 ---p 0002c000 fd:00 1121
/usr/lib64/libpangoft2-1.0.so.0.2400.5
34d5c2b000-34d5c2d000 rw-p 0002b000 fd:00 1121
/usr/lib64/libpangoft2-1.0.so.0.2400.5
34d5e00000-34d5e46000 r-xp 00000000 fd:00 1120
/usr/lib64/libpango-1.0.so.0.2400.5
34d5e46000-34d6046000 ---p 00046000 fd:00 1120
/usr/lib64/libpango-1.0.so.0.2400.5
34d6046000-34d6049000 rw-p 00046000 fd:00 1120
/usr/lib64/libpango-1.0.so.0.2400.5
34d6200000-34d6209000 r-xp 00000000 fd:00 1128
/usr/lib64/libXcursor.so.1.0.2
34d6209000-34d6409000 ---p 00009000 fd:00 1128
/usr/lib64/libXcursor.so.1.0.2
34d6409000-34d640a000 rw-p 00009000 fd:00 1128
/usr/lib64/libXcursor.so.1.0.2
34d6600000-34d6674000 r-xp 00000000 fd:00 1119
/usr/lib64/libcairo.so.2.10800.8
34d6674000-34d6873000 ---p 00074000 fd:00 1119
/usr/lib64/libcairo.so.2.10800.8
34d6873000-34d6876000 rw-p 00073000 fd:00 1119
/usr/lib64/libcairo.so.2.10800.8
34d6a00000-34d6a02000 r-xp 00000000 fd:00 1129
/usr/lib64/libXcomposite.so.1.0.0
34d6a02000-34d6c01000 ---p 00002000 fd:00 1129
/usr/lib64/libXcomposite.so.1.0.0
34d6c01000-34d6c02000 rw-p 00001000 fd:00 1129
/usr/lib64/libXcomposite.so.1.0.0
34d6e00000-34d6e99000 r-xp 00000000 fd:00 1132
/usr/lib64/libgdk-x11-2.0.so.0.1600.5
34d6e99000-34d7099000 ---p 00099000 fd:00 1132
/usr/lib64/libgdk-x11-2.0.so.0.1600.5
34d7099000-34d709e000 rw-p 00099000 fd:00 1132
/usr/lib64/libgdk-x11-2.0.so.0.1600.5
34d7200000-34d7243000 r-xp 00000000 fd:00 1109
/usr/lib64/libpixman-1.so.0.14.0
34d7243000-34d7442000 ---p 00043000 fd:00 1109
/usr/lib64/libpixman-1.so.0.14.0
34d7442000-34d7445000 rw-p 00042000 fd:00 1109
/usr/lib64/libpixman-1.so.0.14.0
34d7600000-34d761d000 r-xp 00000000 fd:00 1131
/usr/lib64/libgdk_pixbuf-2.0.so.0.1600.5
34d761d000-34d781c000 ---p 0001d000 fd:00 1131
/usr/lib64/libgdk_pixbuf-2.0.so.0.1600.5
34d781c000-34d781d000 rw-p 0001c000 fd:00 1131
/usr/lib64/libgdk_pixbuf-2.0.so.0.1600.5
34d7a00000-34d7a08000 r-xp 00000000 fd:00 1126
/usr/lib64/libXrandr.so.2.2.0
34d7a08000-34d7c07000 ---p 00008000 fd:00 1126
/usr/lib64/libXrandr.so.2.2.0
34d7c07000-34d7c08000 rw-p 00007000 fd:00 1126
/usr/lib64/libXrandr.so.2.2.0
34d7e00000-34d7e02000 r-xp 00000000 fd:00 1130
/usr/lib64/libXdamage.so.1.1.0
34d7e02000-34d8001000 ---p 00002000 fd:00 1130
/usr/lib64/libXdamage.so.1.1.0
34d8001000-34d8002000 rw-p 00001000 fd:00 1130
/usr/lib64/libXdamage.so.1.1.0
34d8200000-34d8209000 r-xp 00000000 fd:00 1125
/usr/lib64/libXi.so.6.0.0
34d8209000-34d8409000 ---p 00009000 fd:00 1125
/usr/lib64/libXi.so.6.0.0
34d8409000-34d840a000 rw-p 00009000 fd:00 1125
/usr/lib64/libXi.so.6.0.0
34d8600000-34d8602000 r-xp 00000000 fd:00 1124
/usr/lib64/libXinerama.so.1.0.0
34d8602000-34d8801000 ---p 00002000 fd:00 1124
/usr/lib64/libXinerama.so.1.0.0
34d8801000-34d8802000 rw-p 00001000 fd:00 1124
/usr/lib64/libXinerama.so.1.0.0
34d8a00000-34d8a05000 r-xp 00000000 fd:00 1127
/usr/lib64/libXfixes.so.3.1.0
34d8a05000-34d8c04000 ---p 00005000 fd:00 1127
/usr/lib64/libXfixes.so.3.1.0
34d8c04000-34d8c05000 rw-p 00004000 fd:00 1127
/usr/lib64/libXfixes.so.3.1.0
34d8e00000-34d91d6000 r-xp 00000000 fd:00 1134
/usr/lib64/libgtk-x11-2.0.so.0.1600.5
34d91d6000-34d93d5000 ---p 003d6000 fd:00 1134
/usr/lib64/libgtk-x11-2.0.so.0.1600.5
34d93d5000-34d93e0000 rw-p 003d5000 fd:00 1134
/usr/lib64/libgtk-x11-2.0.so.0.1600.5
34d93e0000-34d93e2000 rw-p 00000000 00:00 0
34d9400000-34d941d000 r-xp 00000000 fd:00 1133
/usr/lib64/libatk-1.0.so.0.2511.1
34d941d000-34d961c000 ---p 0001d000 fd:00 1133
/usr/lib64/libatk-1.0.so.0.2511.1
34d961c000-34d961f000 rw-p 0001c000 fd:00 1133
/usr/lib64/libatk-1.0.so.0.2511.1
34d9800000-34d980b000 r-xp 00000000 fd:00 1122
/usr/lib64/libpangocairo-1.0.so.0.2400.5
34d980b000-34d9a0a000 ---p 0000b000 fd:00 1122
/usr/lib64/libpangocairo-1.0.so.0.2400.5
34d9a0a000-34d9a0b000 rw-p 0000a000 fd:00 1122
/usr/lib64/libpangocairo-1.0.so.0.2400.5
34d9c00000-34d9c20000 r-xp 00000000 fd:00 1144
/usr/lib64/libdbus-glib-1.so.2.1.0
34d9c20000-34d9e1f000 ---p 00020000 fd:00 1144
/usr/lib64/libdbus-glib-1.so.2.1.0
34d9e1f000-34d9e21000 rw-p 0001f000 fd:00 1144
/usr/lib64/libdbus-glib-1.so.2.1.0
34da000000-34da003000 r-xp 00000000 fd:00 16360
/lib64/libuuid.so.1.2
34da003000-34da203000 ---p 00003000 fd:00 16360
/lib64/libuuid.so.1.2
34da203000-34da204000 rw-p 00003000 fd:00 16360
/lib64/libuuid.so.1.2
34da800000-34da85d000 r-xp 00000000 fd:00 1145
/usr/lib64/libORBit-2.so.0.1.0
34da85d000-34daa5c000 ---p 0005d000 fd:00 1145
/usr/lib64/libORBit-2.so.0.1.0
34daa5c000-34daa6f000 rw-p 0005c000 fd:00 1145
/usr/lib64/libORBit-2.so.0.1.0
34db000000-34db039000 r-xp 00000000 fd:00 1146
/usr/lib64/libgconf-2.so.4.1.5
34db039000-34db239000 ---p 00039000 fd:00 1146
/usr/lib64/libgconf-2.so.4.1.5
34db239000-34db23e000 rw-p 00039000 fd:00 1146
/usr/lib64/libgconf-2.so.4.1.5
34db400000-34db407000 r-xp 00000000 fd:00 16361
/usr/lib64/libSM.so.6.0.0
34db407000-34db607000 ---p 00007000 fd:00 16361
/usr/lib64/libSM.so.6.0.0
34db607000-34db608000 rw-p 00007000 fd:00 16361
/usr/lib64/libSM.so.6.0.0
34db800000-34db817000 r-xp 00000000 fd:00 16359
/usr/lib64/libICE.so.6.3.0
34db817000-34dba17000 ---p 00017000 fd:00 16359
/usr/lib64/libICE.so.6.3.0
34dba17000-34dba18000 rw-p 00017000 fd:00 16359
/usr/lib64/libICE.so.6.3.0
34dba18000-34dba1c000 rw-p 00000000 00:00 0
34dd000000-34dd019000 r-xp 00000000 fd:00 1139
/lib64/libgcc_s-4.4.1-20090729.so.1
34dd019000-34dd219000 ---p 00019000 fd:00 1139
/lib64/libgcc_s-4.4.1-20090729.so.1
34dd219000-34dd21a000 rw-p 00019000 fd:00 1139
/lib64/libgcc_s-4.4.1-20090729.so.1
34e0000000-34e0005000 r-xp 00000000 fd:00 26294
/usr/lib64/libXtst.so.6.1.0
34e0005000-34e0205000 ---p 00005000 fd:00 26294
/usr/lib64/libXtst.so.6.1.0
34e0205000-34e0206000 rw-p 00005000 fd:00 26294
/usr/lib64/libXtst.so.6.1.0
34e5000000-34e5018000 r-xp 00000000 fd:00 29867
/usr/lib64/libpolkit.so.2.0.0
34e5018000-34e5218000 ---p 00018000 fd:00 29867
/usr/lib64/libpolkit.so.2.0.0
34e5218000-34e5219000 rw-p 00018000 fd:00 29867
/usr/lib64/libpolkit.so.2.0.0
34e5800000-34e5805000 r-xp 00000000 fd:00 29887
/usr/lib64/libogg.so.0.5.3
34e5805000-34e5a04000 ---p 00005000 fd:00 29887
/usr/lib64/libogg.so.0.5.3
34e5a04000-34e5a05000 rw-p 00004000 fd:00 29887
/usr/lib64/libogg.so.0.5.3
34e6400000-34e6408000 r-xp 00000000 fd:00 1177
/usr/lib64/libltdl.so.7.2.0
34e6408000-34e6608000 ---p 00008000 fd:00 1177
/usr/lib64/libltdl.so.7.2.0
34e6608000-34e6609000 rw-p 00008000 fd:00 1177
/usr/lib64/libltdl.so.7.2.0
34e7400000-34e740c000 r-xp 00000000 fd:00 29868
/usr/lib64/libpolkit-dbus.so.2.0.0
34e740c000-34e760b000 ---p 0000c000 fd:00 29868
/usr/lib64/libpolkit-dbus.so.2.0.0
34e760b000-34e760c000 rw-p 0000b000 fd:00 29868
/usr/lib64/libpolkit-dbus.so.2.0.0
34e7800000-34e781f000 r-xp 00000000 fd:00 29888
/usr/lib64/libvorbis.so.0.4.0
34e781f000-34e7a1e000 ---p 0001f000 fd:00 29888
/usr/lib64/libvorbis.so.0.4.0
34e7a1e000-34e7a2d000 rw-p 0001e000 fd:00 29888
/usr/lib64/libvorbis.so.0.4.0
34e7c00000-34e7c0a000 r-xp 00000000 fd:00 29869
/usr/lib64/libpolkit-grant.so.2.0.0
34e7c0a000-34e7e09000 ---p 0000a000 fd:00 29869
/usr/lib64/libpolkit-grant.so.2.0.0
34e7e09000-34e7e0a000 rw-p 00009000 fd:00 29869
/usr/lib64/libpolkit-grant.so.2.0.0
34e8000000-34e8003000 r-xp 00000000 fd:00 29892
/usr/lib64/libcanberra-gtk.so.0.0.5
34e8003000-34e8203000 ---p 00003000 fd:00 29892
/usr/lib64/libcanberra-gtk.so.0.0.5
34e8203000-34e8204000 rw-p 00003000 fd:00 29892
/usr/lib64/libcanberra-gtk.so.0.0.5
34e8800000-34e880f000 r-xp 00000000 fd:00 29891
/usr/lib64/libcanberra.so.0.1.5
34e880f000-34e8a0e000 ---p 0000f000 fd:00 29891
/usr/lib64/libcanberra.so.0.1.5
34e8a0e000-34e8a0f000 rw-p 0000e000 fd:00 29891
/usr/lib64/libcanberra.so.0.1.5
34e9000000-34e9007000 r-xp 00000000 fd:00 29889
/usr/lib64/libvorbisfile.so.3.2.0
34e9007000-34e9206000 ---p 00007000 fd:00 29889
/usr/lib64/libvorbisfile.so.3.2.0
34e9206000-34e9207000 rw-p 00006000 fd:00 29889
/usr/lib64/libvorbisfile.so.3.2.0
34e9400000-34e940d000 r-xp 00000000 fd:00 29890
/usr/lib64/libtdb.so.1.1.5
34e940d000-34e960c000 ---p 0000d000 fd:00 29890
/usr/lib64/libtdb.so.1.1.5
34e960c000-34e960d000 rw-p 0000c000 fd:00 29890
/usr/lib64/libtdb.so.1.1.5
34e9c00000-34e9c0a000 r-xp 00000000 fd:00 29870
/usr/lib64/libpolkit-gnome.so.0.0.0
34e9c0a000-34e9e0a000 ---p 0000a000 fd:00 29870
/usr/lib64/libpolkit-gnome.so.0.0.0
34e9e0a000-34e9e0b000 rw-p 0000a000 fd:00 29870
/usr/lib64/libpolkit-gnome.so.0.0.0
3d14400000-3d14541000 r-xp 00000000 fd:00 114
/usr/lib64/libxml2.so.2.7.6
3d14541000-3d14740000 ---p 00141000 fd:00 114
/usr/lib64/libxml2.so.2.7.6
3d14740000-3d1474a000 rw-p 00140000 fd:00 114
/usr/lib64/libxml2.so.2.7.6
3d1474a000-3d1474b000 rw-p 00000000 00:00 0
3d14c00000-3d14c18000 r-xp 00000000 fd:00 48785
/usr/lib64/libglade-2.0.so.0.0.7
3d14c18000-3d14e17000 ---p 00018000 fd:00 48785
/usr/lib64/libglade-2.0.so.0.0.7
3d14e17000-3d14e19000 rw-p 00017000 fd:00 48785
/usr/lib64/libglade-2.0.so.0.0.7
3d16800000-3d168ed000 r-xp 00000000 fd:00 22864
/usr/lib64/libstdc++.so.6.0.12
3d168ed000-3d16aec000 ---p 000ed000 fd:00 22864
/usr/lib64/libstdc++.so.6.0.12
3d16aec000-3d16af3000 r--p 000ec000 fd:00 22864
/usr/lib64/libstdc++.so.6.0.12
3d16af3000-3d16af5000 rw-p 000f3000 fd:00 22864
/usr/lib64/libstdc++.so.6.0.12
3d16af5000-3d16b0a000 rw-p 00000000 00:00 0
7f05a3fae000-7f05a3fc1000 r-xp 00000000 fd:00 22909
/usr/lib64/libelf-0.142.so
7f05a3fc1000-7f05a41c0000 ---p 00013000 fd:00 22909
/usr/lib64/libelf-0.142.so
7f05a41c0000-7f05a41c1000 r--p 00012000 fd:00 22909
/usr/lib64/libelf-0.142.so
7f05a41c1000-7f05a41c2000 rw-p 00013000 fd:00 22909
/usr/lib64/libelf-0.142.so
7f05a41d4000-7f05a41d7000 r-xp 00000000 fd:00 116786
/usr/lib64/gtk-2.0/modules/libgnomebreakpad.so
7f05a41d7000-7f05a43d6000 ---p 00003000 fd:00 116786
/usr/lib64/gtk-2.0/modules/libgnomebreakpad.so
7f05a43d6000-7f05a43d7000 rw-p 00002000 fd:00 116786
/usr/lib64/gtk-2.0/modules/libgnomebreakpad.so
7f05a43d7000-7f05a43db000 r-xp 00000000 fd:00 40602
/usr/lib64/gtk-2.0/modules/libcanberra-gtk-module.so
7f05a43db000-7f05a45db000 ---p 00004000 fd:00 40602
/usr/lib64/gtk-2.0/modules/libcanberra-gtk-module.so
7f05a45db000-7f05a45dc000 rw-p 00004000 fd:00 40602
/usr/lib64/gtk-2.0/modules/libcanberra-gtk-module.so
7f05a45dc000-7f05a45df000 r-xp 00000000 fd:00 82244
/usr/lib64/gtk-2.0/modules/libpk-gtk-module.so
7f05a45df000-7f05a47de000 ---p 00003000 fd:00 82244
/usr/lib64/gtk-2.0/modules/libpk-gtk-module.so
7f05a47de000-7f05a47df000 rw-p 00002000 fd:00 82244
/usr/lib64/gtk-2.0/modules/libpk-gtk-module.so
7f05a47df000-7f05a47fb000 r--p 00000000 fd:00 14540
/usr/share/locale/ja/LC_MESSAGES/libc.mo
7f05a47fb000-7f05a480d000 r-xp 00000000 fd:00 53032
/usr/lib64/gtk-2.0/2.10.0/engines/libnodoka.so
7f05a480d000-7f05a4a0d000 ---p 00012000 fd:00 53032
/usr/lib64/gtk-2.0/2.10.0/engines/libnodoka.so
7f05a4a0d000-7f05a4a0e000 rw-p 00012000 fd:00 53032
/usr/lib64/gtk-2.0/2.10.0/engines/libnodoka.so
7f05a4a0e000-7f05a4a0f000 ---p 00000000 00:00 0
7f05a4a0f000-7f05a520f000 rw-p 00000000 00:00 0
7f05a520f000-7f05a521b000 r--p 00000000 fd:00 21639
/usr/share/locale/ja/LC_MESSAGES/glib20.mo
7f05a521b000-7f05a5227000 r-xp 00000000 fd:00 12418
/lib64/libnss_files-2.10.1.so
7f05a5227000-7f05a5426000 ---p 0000c000 fd:00 12418
/lib64/libnss_files-2.10.1.so
7f05a5426000-7f05a5427000 r--p 0000b000 fd:00 12418
/lib64/libnss_files-2.10.1.so
7f05a5427000-7f05a5428000 rw-p 0000c000 fd:00 12418
/lib64/libnss_files-2.10.1.so
7f05a5428000-7f05a543a000 r--p 00000000 fd:00 25291
/usr/share/locale/ja/LC_MESSAGES/GConf2.mo
7f05a543a000-7f05a544e000 r--p 00000000 fd:00 40242
/usr/share/locale/ja/LC_MESSAGES/gtk20.mo
7f05a544e000-7f05aa520000 r--p 00000000 fd:00 14558
/usr/lib/locale/locale-archive
7f05aa520000-7f05aa538000 rw-p 00000000 00:00 0
7f05aa53f000-7f05aa546000 r--s 00000000 fd:00 12712
/usr/lib64/gconv/gconv-modules.cache
7f05aa546000-7f05aa54a000 r--p 00000000 fd:00 110980
/usr/share/locale/ja/LC_MESSAGES/gnome-session-2.0.mo
7f05aa54a000-7f05aa54c000 rw-p 00000000 00:00 0
7fff45b42000-7fff45b57000 rw-p 00000000 00:00 0 [stack]
7fff45be4000-7fff45be5000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0
[vsyscall]

2009-10-27 06:35:02

by Minchan Kim

[permalink] [raw]
Subject: Re: Memory overcommit

On Tue, 27 Oct 2009 15:10:52 +0900
KOSAKI Motohiro <[email protected]> wrote:

> 2009/10/27 KAMEZAWA Hiroyuki <[email protected]>:
> > On Mon, 26 Oct 2009 17:16:14 +0100
> > Vedran Furač <[email protected]> wrote:
> >> >  - Could you show me /var/log/dmesg and /var/log/messages at OOM ?
> >>
> >> It was catastrophe. :) X crashed (or killed) with all the programs, but
> >> my little program was alive for 20 minutes (see timestamps). And for
> >> that time computer was completely unusable. Couldn't even get the
> >> console via ssh. Rally embarrassing for a modern OS to get destroyed by
> >> a 5 lines of C run as an ordinary user. Luckily screen was still alive,
> >> oomk usually kills it also. See for yourself:
> >>
> >> dmesg: http://pastebin.com/f3f83738a
> >> messages: http://pastebin.com/f2091110a
> >>
> >> (CCing to lklm again... I just want people to see the logs.)
> >>
> > Thank you for reporting and your patience. It seems something strange
> > that your KDE programs are killed. I agree.
> >
> > I attached a scirpt for checking oom_score of all exisiting process.
> > (oom_score is a value used for selecting "bad" processs.")
> > please run if you have time.
> >
> > This is a result of my own desktop(on virtual machine.)
> > In this environ (Total memory is 1.6GBytes), mmap(1G) program is running.
> >
> > %check_badness.pl | sort -n | tail
> > --
> > 89924   3938    mixer_applet2
> > 90210   3942    tomboy
> > 94753   3936    clock-applet
> > 101994  3919    pulseaudio
> > 113525  4028    gnome-terminal
> > 127340  1       init
> > 128177  3871    nautilus
> > 151003  11515   bash
> > 256944  11653   mmap
> > 425561  3829    gnome-session
> > --
> > Sigh, gnome-session has twice value of mmap(1G).
> > Of course, gnome-session only uses 6M bytes of anon.
> > I wonder this is because gnome-session has many children..but need to
> > dig more. Does anyone has idea ?
> > (CCed kosaki)
>
> Following output address the issue.
> The fact is, modern desktop application linked pretty many library. it
> makes bloat VSS size and increase
> OOM score.
>
> Ideally, We shouldn't account evictable file-backed mappings for oom_score.
>
Hmm.
I wonder why we consider VM size for OOM kiling.
How about RSS size?


--
Kind regards,
Minchan Kim

2009-10-27 06:38:56

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Memory overcommit

On Tue, 27 Oct 2009 15:34:29 +0900
Minchan Kim <[email protected]> wrote:

> On Tue, 27 Oct 2009 15:10:52 +0900
> KOSAKI Motohiro <[email protected]> wrote:
>
> > 2009/10/27 KAMEZAWA Hiroyuki <[email protected]>:
> > > On Mon, 26 Oct 2009 17:16:14 +0100
> > > Vedran Furač <[email protected]> wrote:
> > >> >  - Could you show me /var/log/dmesg and /var/log/messages at OOM ?
> > >>
> > >> It was catastrophe. :) X crashed (or killed) with all the programs, but
> > >> my little program was alive for 20 minutes (see timestamps). And for
> > >> that time computer was completely unusable. Couldn't even get the
> > >> console via ssh. Rally embarrassing for a modern OS to get destroyed by
> > >> a 5 lines of C run as an ordinary user. Luckily screen was still alive,
> > >> oomk usually kills it also. See for yourself:
> > >>
> > >> dmesg: http://pastebin.com/f3f83738a
> > >> messages: http://pastebin.com/f2091110a
> > >>
> > >> (CCing to lklm again... I just want people to see the logs.)
> > >>
> > > Thank you for reporting and your patience. It seems something strange
> > > that your KDE programs are killed. I agree.
> > >
> > > I attached a scirpt for checking oom_score of all exisiting process.
> > > (oom_score is a value used for selecting "bad" processs.")
> > > please run if you have time.
> > >
> > > This is a result of my own desktop(on virtual machine.)
> > > In this environ (Total memory is 1.6GBytes), mmap(1G) program is running.
> > >
> > > %check_badness.pl | sort -n | tail
> > > --
> > > 89924   3938    mixer_applet2
> > > 90210   3942    tomboy
> > > 94753   3936    clock-applet
> > > 101994  3919    pulseaudio
> > > 113525  4028    gnome-terminal
> > > 127340  1       init
> > > 128177  3871    nautilus
> > > 151003  11515   bash
> > > 256944  11653   mmap
> > > 425561  3829    gnome-session
> > > --
> > > Sigh, gnome-session has twice value of mmap(1G).
> > > Of course, gnome-session only uses 6M bytes of anon.
> > > I wonder this is because gnome-session has many children..but need to
> > > dig more. Does anyone has idea ?
> > > (CCed kosaki)
> >
> > Following output address the issue.
> > The fact is, modern desktop application linked pretty many library. it
> > makes bloat VSS size and increase
> > OOM score.
> >
> > Ideally, We shouldn't account evictable file-backed mappings for oom_score.
> >
> Hmm.
> I wonder why we consider VM size for OOM kiling.
> How about RSS size?
>

Maybe the current code assumes "Tons of swap have been generated, already" if
oom-kill is invoked. Then, just using mm->anon_rss will not be correct.

Hm, should we count # of swap entries reference from mm ?....

Regards,
-Kame

2009-10-27 06:46:39

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Memory overcommit

> > > %check_badness.pl | sort -n | tail
> > > --
> > > 89924   3938    mixer_applet2
> > > 90210   3942    tomboy
> > > 94753   3936    clock-applet
> > > 101994  3919    pulseaudio
> > > 113525  4028    gnome-terminal
> > > 127340  1       init
> > > 128177  3871    nautilus
> > > 151003  11515   bash
> > > 256944  11653   mmap
> > > 425561  3829    gnome-session
> > > --
> > > Sigh, gnome-session has twice value of mmap(1G).
> > > Of course, gnome-session only uses 6M bytes of anon.
> > > I wonder this is because gnome-session has many children..but need to
> > > dig more. Does anyone has idea ?
> > > (CCed kosaki)
> >
> > Following output address the issue.
> > The fact is, modern desktop application linked pretty many library. it
> > makes bloat VSS size and increase
> > OOM score.
> >
> > Ideally, We shouldn't account evictable file-backed mappings for oom_score.
> >
> Hmm.
> I wonder why we consider VM size for OOM kiling.
> How about RSS size?

Because, swap out-ed bad body (e.g. fork bomb process) still should
be killed by oom.
RSS + swap-entries is acceptable to me.


2009-10-27 06:55:23

by Minchan Kim

[permalink] [raw]
Subject: Re: Memory overcommit

On Tue, Oct 27, 2009 at 3:36 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Tue, 27 Oct 2009 15:34:29 +0900
> Minchan Kim <[email protected]> wrote:
>
>> On Tue, 27 Oct 2009 15:10:52 +0900
>> KOSAKI Motohiro <[email protected]> wrote:
>>
>> > 2009/10/27 KAMEZAWA Hiroyuki <[email protected]>:
>> > > On Mon, 26 Oct 2009 17:16:14 +0100
>> > > Vedran Furač <[email protected]> wrote:
>> > >> >  - Could you show me /var/log/dmesg and /var/log/messages at OOM ?
>> > >>
>> > >> It was catastrophe. :) X crashed (or killed) with all the programs, but
>> > >> my little program was alive for 20 minutes (see timestamps). And for
>> > >> that time computer was completely unusable. Couldn't even get the
>> > >> console via ssh. Rally embarrassing for a modern OS to get destroyed by
>> > >> a 5 lines of C run as an ordinary user. Luckily screen was still alive,
>> > >> oomk usually kills it also. See for yourself:
>> > >>
>> > >> dmesg: http://pastebin.com/f3f83738a
>> > >> messages: http://pastebin.com/f2091110a
>> > >>
>> > >> (CCing to lklm again... I just want people to see the logs.)
>> > >>
>> > > Thank you for reporting and your patience. It seems something strange
>> > > that your KDE programs are killed. I agree.
>> > >
>> > > I attached a scirpt for checking oom_score of all exisiting process.
>> > > (oom_score is a value used for selecting "bad" processs.")
>> > > please run if you have time.
>> > >
>> > > This is a result of my own desktop(on virtual machine.)
>> > > In this environ (Total memory is 1.6GBytes), mmap(1G) program is running.
>> > >
>> > > %check_badness.pl | sort -n | tail
>> > > --
>> > > 89924   3938    mixer_applet2
>> > > 90210   3942    tomboy
>> > > 94753   3936    clock-applet
>> > > 101994  3919    pulseaudio
>> > > 113525  4028    gnome-terminal
>> > > 127340  1       init
>> > > 128177  3871    nautilus
>> > > 151003  11515   bash
>> > > 256944  11653   mmap
>> > > 425561  3829    gnome-session
>> > > --
>> > > Sigh, gnome-session has twice value of mmap(1G).
>> > > Of course, gnome-session only uses 6M bytes of anon.
>> > > I wonder this is because gnome-session has many children..but need to
>> > > dig more. Does anyone has idea ?
>> > > (CCed kosaki)
>> >
>> > Following output address the issue.
>> > The fact is, modern desktop application linked pretty many library. it
>> > makes bloat VSS size and increase
>> > OOM score.
>> >
>> > Ideally, We shouldn't account evictable file-backed mappings for oom_score.
>> >
>> Hmm.
>> I wonder why we consider VM size for OOM kiling.
>> How about RSS size?
>>
>
> Maybe the current code assumes "Tons of swap have been generated, already" if
> oom-kill is invoked. Then, just using mm->anon_rss will not be correct.
>
> Hm, should we count # of swap entries reference from mm ?....

In Vedran case, he didn't use swap. So, Only considering vm is the problem.
I think it would be better to consider both RSS + # of swap entries as
Kosaki mentioned.


>
> Regards,
> -Kame
>
>
>



--
Kind regards,
Minchan Kim

2009-10-27 06:57:32

by Minchan Kim

[permalink] [raw]
Subject: Re: Memory overcommit

On Tue, 27 Oct 2009 15:46:36 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> > > > %check_badness.pl | sort -n | tail
> > > > --
> > > > 89924   3938    mixer_applet2
> > > > 90210   3942    tomboy
> > > > 94753   3936    clock-applet
> > > > 101994  3919    pulseaudio
> > > > 113525  4028    gnome-terminal
> > > > 127340  1       init
> > > > 128177  3871    nautilus
> > > > 151003  11515   bash
> > > > 256944  11653   mmap
> > > > 425561  3829    gnome-session
> > > > --
> > > > Sigh, gnome-session has twice value of mmap(1G).
> > > > Of course, gnome-session only uses 6M bytes of anon.
> > > > I wonder this is because gnome-session has many children..but need to
> > > > dig more. Does anyone has idea ?
> > > > (CCed kosaki)
> > >
> > > Following output address the issue.
> > > The fact is, modern desktop application linked pretty many library. it
> > > makes bloat VSS size and increase
> > > OOM score.
> > >
> > > Ideally, We shouldn't account evictable file-backed mappings for oom_score.
> > >
> > Hmm.
> > I wonder why we consider VM size for OOM kiling.
> > How about RSS size?
>
> Because, swap out-ed bad body (e.g. fork bomb process) still should
> be killed by oom.
> RSS + swap-entries is acceptable to me.

It's reasonable to me.
As I mentioned by reply of kame, in Vedran case, he didn't use swap.
I think only considering vm is the problem.

--
Kind regards,
Minchan Kim

2009-10-27 07:47:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

On Tue, 27 Oct 2009 15:55:26 +0900
Minchan Kim <[email protected]> wrote:

> >> Hmm.
> >> I wonder why we consider VM size for OOM kiling.
> >> How about RSS size?
> >>
> >
> > Maybe the current code assumes "Tons of swap have been generated, already" if
> > oom-kill is invoked. Then, just using mm->anon_rss will not be correct.
> >
> > Hm, should we count # of swap entries reference from mm ?....
>
> In Vedran case, he didn't use swap. So, Only considering vm is the problem.
> I think it would be better to consider both RSS + # of swap entries as
> Kosaki mentioned.
>
Then, maybe this kind of patch is necessary.
This is on 2.6.31...then I may have to rebase this to mmotom.
Added more CCs.

Vedran, I'm glad if you can test this patch.


==
Now, oom-killer's score uses mm->total_vm as its base value.
But, in these days, applications like GUI program tend to use
much shared libraries and total_vm grows too high even when
pages are not fully mapped.

For example, running a program "mmap" which allocates 1 GBbytes of
anonymous memory, oom_score top 10 on system will be..

score PID name
89924 3938 mixer_applet2
90210 3942 tomboy
94753 3936 clock-applet
101994 3919 pulseaudio
113525 4028 gnome-terminal
127340 1 init
128177 3871 nautilus
151003 11515 bash
256944 11653 mmap <-----------------use 1G of anon
425561 3829 gnome-session

No one believes gnome-session is more guilty than "mmap".

Instead of total_vm, we should use anon/file/swap usage of a process, I think.
This patch adds mm->swap_usage and calculate oom_score based on
anon_rss + file_rss + swap_usage.
Considering usual applications, this will be much better information than
total_vm. After this patch, the score on my desktop is

score PID name
4033 3176 gnome-panel
4077 3113 xinit
4526 3190 python
4820 3161 gnome-settings-
4989 3289 gnome-terminal
7105 3271 tomboy
8427 3177 nautilus
17549 3140 gnome-session
128501 3299 bash
256106 3383 mmap

This order is not bad, I think.

Note: This adss new counter...then new cost is added.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/mm_types.h | 1 +
mm/memory.c | 29 +++++++++++++++++++++--------
mm/oom_kill.c | 12 +++++++++---
mm/rmap.c | 1 +
mm/swapfile.c | 1 +
5 files changed, 33 insertions(+), 11 deletions(-)

Index: linux-2.6.31/include/linux/mm_types.h
===================================================================
--- linux-2.6.31.orig/include/linux/mm_types.h
+++ linux-2.6.31/include/linux/mm_types.h
@@ -228,6 +228,7 @@ struct mm_struct {
*/
mm_counter_t _file_rss;
mm_counter_t _anon_rss;
+ mm_counter_t _swap_usage;

unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
Index: linux-2.6.31/mm/memory.c
===================================================================
--- linux-2.6.31.orig/mm/memory.c
+++ linux-2.6.31/mm/memory.c
@@ -361,12 +361,15 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
return 0;
}

-static inline void add_mm_rss(struct mm_struct *mm, int file_rss, int anon_rss)
+static inline
+void add_mm_rss(struct mm_struct *mm, int file_rss, int anon_rss, int swaps)
{
if (file_rss)
add_mm_counter(mm, file_rss, file_rss);
if (anon_rss)
add_mm_counter(mm, anon_rss, anon_rss);
+ if (swaps)
+ add_mm_counter(mm, swap_usage, swaps);
}

/*
@@ -562,6 +565,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
&src_mm->mmlist);
spin_unlock(&mmlist_lock);
}
+ if (!is_migration_entry(entry))
+ rss[2]++;
if (is_write_migration_entry(entry) &&
is_cow_mapping(vm_flags)) {
/*
@@ -611,10 +616,10 @@ static int copy_pte_range(struct mm_stru
pte_t *src_pte, *dst_pte;
spinlock_t *src_ptl, *dst_ptl;
int progress = 0;
- int rss[2];
+ int rss[3];

again:
- rss[1] = rss[0] = 0;
+ rss[2] = rss[1] = rss[0] = 0;
dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
if (!dst_pte)
return -ENOMEM;
@@ -645,7 +650,7 @@ again:
arch_leave_lazy_mmu_mode();
spin_unlock(src_ptl);
pte_unmap_nested(src_pte - 1);
- add_mm_rss(dst_mm, rss[0], rss[1]);
+ add_mm_rss(dst_mm, rss[0], rss[1], rss[2]);
pte_unmap_unlock(dst_pte - 1, dst_ptl);
cond_resched();
if (addr != end)
@@ -769,6 +774,7 @@ static unsigned long zap_pte_range(struc
spinlock_t *ptl;
int file_rss = 0;
int anon_rss = 0;
+ int swaps = 0;

pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -838,13 +844,19 @@ static unsigned long zap_pte_range(struc
if (pte_file(ptent)) {
if (unlikely(!(vma->vm_flags & VM_NONLINEAR)))
print_bad_pte(vma, addr, ptent, NULL);
- } else if
- (unlikely(!free_swap_and_cache(pte_to_swp_entry(ptent))))
- print_bad_pte(vma, addr, ptent, NULL);
+ } else {
+ swp_entry_t entry = pte_to_swp_entry(ptent);
+
+ if (!is_migration_entry(entry))
+ swaps++;
+
+ if (unlikely(!free_swap_and_cache(entry)))
+ print_bad_pte(vma, addr, ptent, NULL);
+ }
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));

- add_mm_rss(mm, file_rss, anon_rss);
+ add_mm_rss(mm, file_rss, anon_rss, swaps);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);

@@ -2573,6 +2585,7 @@ static int do_swap_page(struct mm_struct
*/

inc_mm_counter(mm, anon_rss);
+ dec_mm_counter(mm, swap_usage);
pte = mk_pte(page, vma->vm_page_prot);
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
Index: linux-2.6.31/mm/rmap.c
===================================================================
--- linux-2.6.31.orig/mm/rmap.c
+++ linux-2.6.31/mm/rmap.c
@@ -834,6 +834,7 @@ static int try_to_unmap_one(struct page
spin_unlock(&mmlist_lock);
}
dec_mm_counter(mm, anon_rss);
+ inc_mm_counter(mm, swap_usage);
} else if (PAGE_MIGRATION) {
/*
* Store the pfn of the page in a special migration
Index: linux-2.6.31/mm/swapfile.c
===================================================================
--- linux-2.6.31.orig/mm/swapfile.c
+++ linux-2.6.31/mm/swapfile.c
@@ -830,6 +830,7 @@ static int unuse_pte(struct vm_area_stru
}

inc_mm_counter(vma->vm_mm, anon_rss);
+ dec_mm_counter(vma->vm_mm, swap_usage);
get_page(page);
set_pte_at(vma->vm_mm, addr, pte,
pte_mkold(mk_pte(page, vma->vm_page_prot)));
Index: linux-2.6.31/mm/oom_kill.c
===================================================================
--- linux-2.6.31.orig/mm/oom_kill.c
+++ linux-2.6.31/mm/oom_kill.c
@@ -69,7 +69,8 @@ unsigned long badness(struct task_struct
/*
* The memory size of the process is the basis for the badness.
*/
- points = mm->total_vm;
+ points = get_mm_counter(mm, anon_rss) + get_mm_counter(mm, file_rss)
+ + get_mm_counter(mm, swap_usage);

/*
* After this unlock we can no longer dereference local variable `mm'
@@ -92,8 +93,13 @@ unsigned long badness(struct task_struct
*/
list_for_each_entry(child, &p->children, sibling) {
task_lock(child);
- if (child->mm != mm && child->mm)
- points += child->mm->total_vm/2 + 1;
+ if (child->mm != mm && child->mm) {
+ unsigned long cpoint;
+ /* At considering child, we don't count swap */
+ cpoint = get_mm_counter(child->mm, anon_rss) +
+ get_mm_counter(child->mm, file_rss);
+ points += cpoint/2 + 1;
+ }
task_unlock(child);
}

2009-10-27 07:56:46

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

On Tue, 27 Oct 2009 16:45:26 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Tue, 27 Oct 2009 15:55:26 +0900
> Minchan Kim <[email protected]> wrote:
>
> > >> Hmm.
> > >> I wonder why we consider VM size for OOM kiling.
> > >> How about RSS size?
> > >>
> > >
> > > Maybe the current code assumes "Tons of swap have been generated, already" if
> > > oom-kill is invoked. Then, just using mm->anon_rss will not be correct.
> > >
> > > Hm, should we count # of swap entries reference from mm ?....
> >
> > In Vedran case, he didn't use swap. So, Only considering vm is the problem.
> > I think it would be better to consider both RSS + # of swap entries as
> > Kosaki mentioned.
> >
> Then, maybe this kind of patch is necessary.
> This is on 2.6.31...then I may have to rebase this to mmotom.
> Added more CCs.
>
> Vedran, I'm glad if you can test this patch.
>
>
> ==
> Now, oom-killer's score uses mm->total_vm as its base value.
> But, in these days, applications like GUI program tend to use
> much shared libraries and total_vm grows too high even when
> pages are not fully mapped.
>
> For example, running a program "mmap" which allocates 1 GBbytes of
> anonymous memory, oom_score top 10 on system will be..
>
> score PID name
> 89924 3938 mixer_applet2
> 90210 3942 tomboy
> 94753 3936 clock-applet
> 101994 3919 pulseaudio
> 113525 4028 gnome-terminal
> 127340 1 init
> 128177 3871 nautilus
> 151003 11515 bash
> 256944 11653 mmap <-----------------use 1G of anon
> 425561 3829 gnome-session
>
> No one believes gnome-session is more guilty than "mmap".
>
> Instead of total_vm, we should use anon/file/swap usage of a process, I think.
> This patch adds mm->swap_usage and calculate oom_score based on
> anon_rss + file_rss + swap_usage.
> Considering usual applications, this will be much better information than
> total_vm. After this patch, the score on my desktop is
>
> score PID name
> 4033 3176 gnome-panel
> 4077 3113 xinit
> 4526 3190 python
> 4820 3161 gnome-settings-
> 4989 3289 gnome-terminal
> 7105 3271 tomboy
> 8427 3177 nautilus
> 17549 3140 gnome-session
> 128501 3299 bash
> 256106 3383 mmap
>
> This order is not bad, I think.
>
> Note: This adss new counter...then new cost is added.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Thanks for making the patch.
Let's hear other's opinion. :)

--
Kind regards,
Minchan Kim

2009-10-27 07:58:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

On Tue, 27 Oct 2009 16:45:26 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:
/*
> * After this unlock we can no longer dereference local variable `mm'
> @@ -92,8 +93,13 @@ unsigned long badness(struct task_struct
> */
> list_for_each_entry(child, &p->children, sibling) {
> task_lock(child);
> - if (child->mm != mm && child->mm)
> - points += child->mm->total_vm/2 + 1;
> + if (child->mm != mm && child->mm) {
> + unsigned long cpoint;
> + /* At considering child, we don't count swap */
> + cpoint = get_mm_counter(child->mm, anon_rss) +
> + get_mm_counter(child->mm, file_rss);
> + points += cpoint/2 + 1;
> + }
> task_unlock(child);

BTW, I'd like to get rid of this code.

Can't we use other techniques for detecting fork-bomb ?

This check can't catch following type, anyway.

fork()
-> fork()
-> fork()
-> fork()
....

but I have no good idea.
What is the difference with task-launcher and fork bomb()...

Thanks,
-Kame

2009-10-27 08:15:17

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

On Tue, 27 Oct 2009 16:56:28 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Tue, 27 Oct 2009 16:45:26 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
> /*
> > * After this unlock we can no longer dereference local variable `mm'
> > @@ -92,8 +93,13 @@ unsigned long badness(struct task_struct
> > */
> > list_for_each_entry(child, &p->children, sibling) {
> > task_lock(child);
> > - if (child->mm != mm && child->mm)
> > - points += child->mm->total_vm/2 + 1;
> > + if (child->mm != mm && child->mm) {
> > + unsigned long cpoint;
> > + /* At considering child, we don't count swap */
> > + cpoint = get_mm_counter(child->mm, anon_rss) +
> > + get_mm_counter(child->mm, file_rss);
> > + points += cpoint/2 + 1;
> > + }
> > task_unlock(child);
>
> BTW, I'd like to get rid of this code.
>
> Can't we use other techniques for detecting fork-bomb ?
>
> This check can't catch following type, anyway.
>
> fork()
> -> fork()
> -> fork()
> -> fork()
> ....
>
> but I have no good idea.
> What is the difference with task-launcher and fork bomb()...
>

I think it's good as-is.
Kernel is hard to know it by effiecient method.
It depends on applications. so Doesnt's task-launcher
like gnome-session have to control his oom_score?

Welcome to any ideas if kernel can do it well.

> Thanks,
> -Kame
>


--
Kind regards,
Minchan Kim

2009-10-27 08:35:38

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

On Tue, 27 Oct 2009 17:14:41 +0900
Minchan Kim <[email protected]> wrote:

> On Tue, 27 Oct 2009 16:56:28 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Tue, 27 Oct 2009 16:45:26 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > /*
> > > * After this unlock we can no longer dereference local variable `mm'
> > > @@ -92,8 +93,13 @@ unsigned long badness(struct task_struct
> > > */
> > > list_for_each_entry(child, &p->children, sibling) {
> > > task_lock(child);
> > > - if (child->mm != mm && child->mm)
> > > - points += child->mm->total_vm/2 + 1;
> > > + if (child->mm != mm && child->mm) {
> > > + unsigned long cpoint;
> > > + /* At considering child, we don't count swap */
> > > + cpoint = get_mm_counter(child->mm, anon_rss) +
> > > + get_mm_counter(child->mm, file_rss);
> > > + points += cpoint/2 + 1;
> > > + }
> > > task_unlock(child);
> >
> > BTW, I'd like to get rid of this code.
> >
> > Can't we use other techniques for detecting fork-bomb ?
> >
> > This check can't catch following type, anyway.
> >
> > fork()
> > -> fork()
> > -> fork()
> > -> fork()
> > ....
> >
> > but I have no good idea.
> > What is the difference with task-launcher and fork bomb()...
> >
>
> I think it's good as-is.
> Kernel is hard to know it by effiecient method.
> It depends on applications. so Doesnt's task-launcher
> like gnome-session have to control his oom_score?
>
> Welcome to any ideas if kernel can do it well.
>
Hmmm, check system-wide fork/sec and fork-depth ? Maybe not difficult to calculate..

Regards,
-Kame

2009-10-27 08:53:18

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

On Tue, 27 Oct 2009 17:33:08 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Tue, 27 Oct 2009 17:14:41 +0900
> Minchan Kim <[email protected]> wrote:
>
> > On Tue, 27 Oct 2009 16:56:28 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> > > On Tue, 27 Oct 2009 16:45:26 +0900
> > > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > /*
> > > > * After this unlock we can no longer dereference local variable `mm'
> > > > @@ -92,8 +93,13 @@ unsigned long badness(struct task_struct
> > > > */
> > > > list_for_each_entry(child, &p->children, sibling) {
> > > > task_lock(child);
> > > > - if (child->mm != mm && child->mm)
> > > > - points += child->mm->total_vm/2 + 1;
> > > > + if (child->mm != mm && child->mm) {
> > > > + unsigned long cpoint;
> > > > + /* At considering child, we don't count swap */
> > > > + cpoint = get_mm_counter(child->mm, anon_rss) +
> > > > + get_mm_counter(child->mm, file_rss);
> > > > + points += cpoint/2 + 1;
> > > > + }
> > > > task_unlock(child);
> > >
> > > BTW, I'd like to get rid of this code.
> > >
> > > Can't we use other techniques for detecting fork-bomb ?
> > >
> > > This check can't catch following type, anyway.
> > >
> > > fork()
> > > -> fork()
> > > -> fork()
> > > -> fork()
> > > ....
> > >
> > > but I have no good idea.
> > > What is the difference with task-launcher and fork bomb()...
> > >
> >
> > I think it's good as-is.
> > Kernel is hard to know it by effiecient method.
> > It depends on applications. so Doesnt's task-launcher
> > like gnome-session have to control his oom_score?
> >
> > Welcome to any ideas if kernel can do it well.
> >
> Hmmm, check system-wide fork/sec and fork-depth ? Maybe not difficult to calculate..

Yes. We can do anything to achieve the goal in kernel.
Maybe check the time or fork-depth counting.
What I have a concern is how we can do it nicely if it is a serious
problem in kernel. ;)

I think most of program which have many child are victims of OOM killing.
It make sense to me. There is some cases to not make sense like task-launcher.
So I think if task-launcher which is very rare and special program can change
oom_adj by itself, it's good than thing that add new heuristic in kernel.

It's just my opinon. :)

> Regards,
> -Kame
>


--
Kind regards,
Minchan Kim

2009-10-27 08:58:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

On Tue, 27 Oct 2009 17:52:43 +0900
Minchan Kim <[email protected]> wrote:

> On Tue, 27 Oct 2009 17:33:08 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Tue, 27 Oct 2009 17:14:41 +0900
> > Minchan Kim <[email protected]> wrote:
> >
> > > On Tue, 27 Oct 2009 16:56:28 +0900
> > > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > >
> > > > On Tue, 27 Oct 2009 16:45:26 +0900
> > > > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > > /*
> > > > > * After this unlock we can no longer dereference local variable `mm'
> > > > > @@ -92,8 +93,13 @@ unsigned long badness(struct task_struct
> > > > > */
> > > > > list_for_each_entry(child, &p->children, sibling) {
> > > > > task_lock(child);
> > > > > - if (child->mm != mm && child->mm)
> > > > > - points += child->mm->total_vm/2 + 1;
> > > > > + if (child->mm != mm && child->mm) {
> > > > > + unsigned long cpoint;
> > > > > + /* At considering child, we don't count swap */
> > > > > + cpoint = get_mm_counter(child->mm, anon_rss) +
> > > > > + get_mm_counter(child->mm, file_rss);
> > > > > + points += cpoint/2 + 1;
> > > > > + }
> > > > > task_unlock(child);
> > > >
> > > > BTW, I'd like to get rid of this code.
> > > >
> > > > Can't we use other techniques for detecting fork-bomb ?
> > > >
> > > > This check can't catch following type, anyway.
> > > >
> > > > fork()
> > > > -> fork()
> > > > -> fork()
> > > > -> fork()
> > > > ....
> > > >
> > > > but I have no good idea.
> > > > What is the difference with task-launcher and fork bomb()...
> > > >
> > >
> > > I think it's good as-is.
> > > Kernel is hard to know it by effiecient method.
> > > It depends on applications. so Doesnt's task-launcher
> > > like gnome-session have to control his oom_score?
> > >
> > > Welcome to any ideas if kernel can do it well.
> > >
> > Hmmm, check system-wide fork/sec and fork-depth ? Maybe not difficult to calculate..
>
> Yes. We can do anything to achieve the goal in kernel.
> Maybe check the time or fork-depth counting.
> What I have a concern is how we can do it nicely if it is a serious
> problem in kernel. ;)
>
yes...only the user knows whether user is wrong, finally. Especially in case
of memory leak.

> I think most of program which have many child are victims of OOM killing.
> It make sense to me. There is some cases to not make sense like task-launcher.
> So I think if task-launcher which is very rare and special program can change
> oom_adj by itself, it's good than thing that add new heuristic in kernel.
>
> It's just my opinon. :)
>
I know KDE already adjsut oom_adj for their 3.5 release ;)
Okay, concentrate on avoiding total_vm issue for a while.

Thanks,
-Kame

2009-10-27 12:38:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

On Tue, Oct 27, 2009 at 04:56:12PM +0900, Minchan Kim wrote:
> Thanks for making the patch.
> Let's hear other's opinion. :)

total_vm is nearly meaningless, especially on 64bit that reduces the
mmap load on libs, I tried to change it to something "physical" (rss,
didn't add swap too) some time ago too, not sure why I didn't manage
to get it in. Trying again surely sounds good. Accounting swap isn't
necessarily good, we may be killing a task that isn't accessing memory
at all. So yes, we free swap but if the task is the "bloater" it's
unlikely to be all in swap as it did all recent activity that lead to
the oom. So I'm unsure if swap is good to account here, but surely I
ack to replace virtual with rss. I would include the whole rss, as the
file one may also be rendered unswappable if it is accessed in a loop
refreshing the young bit all the time.

2009-10-27 17:12:45

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

KAMEZAWA Hiroyuki wrote:

> On Mon, 26 Oct 2009 17:16:14 +0100
> Vedran Furač <[email protected]> wrote:
>>> - Could you show me /var/log/dmesg and /var/log/messages at OOM ?
>> It was catastrophe. :) X crashed (or killed) with all the programs, but
>> my little program was alive for 20 minutes (see timestamps). And for
>> that time computer was completely unusable. Couldn't even get the
>> console via ssh. Rally embarrassing for a modern OS to get destroyed by
>> a 5 lines of C run as an ordinary user. Luckily screen was still alive,
>> oomk usually kills it also. See for yourself:
>>
>> dmesg: http://pastebin.com/f3f83738a
>> messages: http://pastebin.com/f2091110a
>>
>> (CCing to lklm again... I just want people to see the logs.)
>>
> Thank you for reporting and your patience. It seems something strange
> that your KDE programs are killed. I agree.

No problem. I want this to be solved as much as you do. Actually, it is
not strange, just a buggy algorithm.

Run:

% ps -T -eo pid,ppid,tid,vsz,command

You'll see that ppid of a number of processes is kdeinit, gnome-session,
fvwm or something else depending on what one is using. All of this
processes are started automatically during startup or manually clicking
on a menu item or by some keyboard shortcut. OOM algorithm just sums
memory usage of all of them and adds that ot the parent. Just plain wrong.

Also, it seems it's looking at VIRT instead of RES.

> I attached a scirpt for checking oom_score of all exisiting process.
> (oom_score is a value used for selecting "bad" processs.")
> please run if you have time.

96890 21463 VirtualBox // OK
118615 11144 kded4 // WRONG
127455 11158 knotify4 // WRONG
132198 1 init // WRONG
133940 11151 ksmserver // WRONG
134109 11224 audacious2 // Audio player, maybe
145476 21503 VirtualBox // OK
174939 11322 icedove-bin // thunderbird, maybe
178015 11223 akregator // rss reader, maybe
201043 22672 krusader // WRONG
212609 11187 krunner // WRONG
256911 24252 test // culprit, malloced 1GB
1750371 11318 run-mozilla.sh // tiny, parent of firefox threads
2044902 11141 kdeinit4 // tiny, parent of most KDE apps

> Sigh, gnome-session has twice value of mmap(1G).
> Of course, gnome-session only uses 6M bytes of anon.
> I wonder this is because gnome-session has many children..but need to

Yes it is.

Regards,

Vedran

2009-10-27 17:41:24

by Vedran Furač

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

KAMEZAWA Hiroyuki wrote:

> On Tue, 27 Oct 2009 15:55:26 +0900
> Minchan Kim <[email protected]> wrote:
>
>>>> Hmm.
>>>> I wonder why we consider VM size for OOM kiling.
>>>> How about RSS size?
>>>>
>>> Maybe the current code assumes "Tons of swap have been generated, already" if
>>> oom-kill is invoked. Then, just using mm->anon_rss will not be correct.
>>>
>>> Hm, should we count # of swap entries reference from mm ?....
>> In Vedran case, he didn't use swap. So, Only considering vm is the problem.
>> I think it would be better to consider both RSS + # of swap entries as
>> Kosaki mentioned.
>>
> Then, maybe this kind of patch is necessary.
> This is on 2.6.31...then I may have to rebase this to mmotom.
> Added more CCs.
>
> Vedran, I'm glad if you can test this patch.

Thanks for the patch! I'll test it during this week a report after that.

> Instead of total_vm, we should use anon/file/swap usage of a process, I think.
> This patch adds mm->swap_usage and calculate oom_score based on
> anon_rss + file_rss + swap_usage.

Isn't file_rss shared between processes? Sorry, I'm newbie. :)

% pmap $(pidof test)
29049: ./test
0000000000400000 4K r-x-- /home/vedranf/dev/tmp/test
0000000000600000 4K rw--- /home/vedranf/dev/tmp/test
00002ba362a80000 116K r-x-- /lib/ld-2.10.1.so
00002ba362a9d000 12K rw--- [ anon ]
00002ba362c9c000 4K r---- /lib/ld-2.10.1.so
00002ba362c9d000 4K rw--- /lib/ld-2.10.1.so
00002ba362c9e000 1320K r-x-- /lib/libc-2.10.1.so
00002ba362de8000 2044K ----- /lib/libc-2.10.1.so
00002ba362fe7000 16K r---- /lib/libc-2.10.1.so
00002ba362feb000 4K rw--- /lib/libc-2.10.1.so
00002ba362fec000 1024028K rw--- [ anon ] // <-- This
00007ffff4618000 84K rw--- [ stack ]
00007ffff47b7000 4K r-x-- [ anon ]
ffffffffff600000 4K r-x-- [ anon ]
total 1027648K

I would just look at anon if that's OK (or possible).

> Considering usual applications, this will be much better information than
> total_vm.

Agreed.

> score PID name
> 4033 3176 gnome-panel
> 4077 3113 xinit
> 4526 3190 python
> 4820 3161 gnome-settings-
> 4989 3289 gnome-terminal
> 7105 3271 tomboy
> 8427 3177 nautilus
> 17549 3140 gnome-session
> 128501 3299 bash
> 256106 3383 mmap
>
> This order is not bad, I think.

Yes, this looks much better now. Bash is only having somewhat strangely
high score.

Regards,

Vedran

2009-10-27 18:02:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Memory overcommit

>> I attached a scirpt for checking oom_score of all exisiting process.
>> (oom_score is a value used for selecting "bad" processs.")
>> please run if you have time.
>
> 96890 ? 21463 ? VirtualBox // OK
> 118615 ?11144 ? kded4 // WRONG
> 127455 ?11158 ? knotify4 // WRONG
> 132198 ?1 ? ? ? init // WRONG
> 133940 ?11151 ? ksmserver // WRONG
> 134109 ?11224 ? audacious2 // Audio player, maybe
> 145476 ?21503 ? VirtualBox // OK
> 174939 ?11322 ? icedove-bin // thunderbird, maybe
> 178015 ?11223 ? akregator // rss reader, maybe
> 201043 ?22672 ? krusader ?// WRONG
> 212609 ?11187 ? krunner // WRONG
> 256911 ?24252 ? test // culprit, malloced 1GB
> 1750371 11318 ? run-mozilla.sh // tiny, parent of firefox threads
> 2044902 11141 ? kdeinit4 // tiny, parent of most KDE apps

Verdran, I made alternative improvement idea. Can you please mesure
badness score
on your system?
Maybe your culprit process take biggest badness value.

Note: this patch change time related thing. So, please drink a cup of
coffee before mesurement.
small rest time makes correct test result.


Attachments:
0001-oom-oom-score-bonus-by-run_time-use-proportional-va.patch (2.95 kB)

2009-10-27 18:30:26

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

KOSAKI Motohiro wrote:

>>> I attached a scirpt for checking oom_score of all exisiting process.
>>> (oom_score is a value used for selecting "bad" processs.")
>>> please run if you have time.
>> 96890 21463 VirtualBox // OK
>> 118615 11144 kded4 // WRONG
>> 127455 11158 knotify4 // WRONG
>> 132198 1 init // WRONG
>> 133940 11151 ksmserver // WRONG
>> 134109 11224 audacious2 // Audio player, maybe
>> 145476 21503 VirtualBox // OK
>> 174939 11322 icedove-bin // thunderbird, maybe
>> 178015 11223 akregator // rss reader, maybe
>> 201043 22672 krusader // WRONG
>> 212609 11187 krunner // WRONG
>> 256911 24252 test // culprit, malloced 1GB
>> 1750371 11318 run-mozilla.sh // tiny, parent of firefox threads
>> 2044902 11141 kdeinit4 // tiny, parent of most KDE apps
>
> Verdran, I made alternative improvement idea. Can you please mesure
> badness score
> on your system?
> Maybe your culprit process take biggest badness value.

Thanks, I'll test it during the week. But note that not every user
reboots its computer everyday. I, for example, usually have it up for
days. And when it comes to my laptop - weeks, as I just suspend it when
I don't use it. Maybe the best way is to combine two patches. Also, you
and others could also test these patches. It is not only my kernel that
behaves strange. :)

> Note: this patch change time related thing. So, please drink a cup of
> coffee before mesurement.
> small rest time makes correct test result.

OK. :)

Regards,

Vedran

2009-10-27 18:39:02

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

On Tue, 27 Oct 2009, KAMEZAWA Hiroyuki wrote:
> Now, oom-killer's score uses mm->total_vm as its base value.
> But, in these days, applications like GUI program tend to use
> much shared libraries and total_vm grows too high even when
> pages are not fully mapped.
>
> For example, running a program "mmap" which allocates 1 GBbytes of
> anonymous memory, oom_score top 10 on system will be..
>
> score PID name
> 89924 3938 mixer_applet2
> 90210 3942 tomboy
> 94753 3936 clock-applet
> 101994 3919 pulseaudio
> 113525 4028 gnome-terminal
> 127340 1 init
> 128177 3871 nautilus
> 151003 11515 bash
> 256944 11653 mmap <-----------------use 1G of anon
> 425561 3829 gnome-session
>
> No one believes gnome-session is more guilty than "mmap".
>
> Instead of total_vm, we should use anon/file/swap usage of a process, I think.
> This patch adds mm->swap_usage and calculate oom_score based on
> anon_rss + file_rss + swap_usage.
> Considering usual applications, this will be much better information than
> total_vm. After this patch, the score on my desktop is
>
> score PID name
> 4033 3176 gnome-panel
> 4077 3113 xinit
> 4526 3190 python
> 4820 3161 gnome-settings-
> 4989 3289 gnome-terminal
> 7105 3271 tomboy
> 8427 3177 nautilus
> 17549 3140 gnome-session
> 128501 3299 bash
> 256106 3383 mmap
>
> This order is not bad, I think.
>
> Note: This adss new counter...then new cost is added.

I've often thought we ought to supply such a swap_usage statistic;
and show it in /proc/pid/statsomething, presumably VmSwap in
/proc/pid/status, even an additional field on the end of statm.

A slight new cost, yes: doesn't matter at the swapping end, but
would slightly impact fork and exit - I do hope we can afford it,
because I think it should have been available all along.

I've not checked your patch in detail; but I do agree that basing
OOM (physical memory) decisions on total_vm (virtual memory) has
seemed weird, so it's well worth trying this approach. Whether swap
should be included along with rss isn't quite clear to me: I'm not
saying you're wrong, not at all, just that it's not quite obvious.

I've several observations to make about bad OOM kill decisions,
but it's probably better that I make them in the original
"Memory overcommit" thread, rather than divert this thread.

Hugh

2009-10-27 18:48:29

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

On Tue, Oct 27, 2009 at 06:39:07PM +0000, Hugh Dickins wrote:
> OOM (physical memory) decisions on total_vm (virtual memory) has
> seemed weird, so it's well worth trying this approach. Whether swap

It is weird and wrong, I strongly support fixing it once and for
all. The oom killing should be based on physical info, total_vm is
a very rough approximation of the real info we're interested about
(real RAM utilization of the task).

> should be included along with rss isn't quite clear to me: I'm not
> saying you're wrong, not at all, just that it's not quite obvious.

Agreed it's not obvious. Intuitively I think only including RSS and no
swap is best, but clearly I can't be entirely against including swap
too as there may be scenarios where including swap provides for a
better choice.

My argument for not including swap is that we kill tasks to free RAM
(we don't really care to free swap, system needs RAM at oom time).
Freeing swap won't immediately help because no RAM is freed when swap
is released (sure other tasks that sits huge in RAM can be moved to
swap after swap isn't full but if we immediately killed those tasks
that were huge in RAM in the first place we'd be better off).

> I've several observations to make about bad OOM kill decisions,
> but it's probably better that I make them in the original
> "Memory overcommit" thread, rather than divert this thread.

:)

2009-10-27 20:44:21

by Hugh Dickins

[permalink] [raw]
Subject: Re: Memory overcommit

On Tue, 27 Oct 2009, KAMEZAWA Hiroyuki wrote:
> Sigh, gnome-session has twice value of mmap(1G).
> Of course, gnome-session only uses 6M bytes of anon.
> I wonder this is because gnome-session has many children..but need to
> dig more. Does anyone has idea ?

When preparing KSM unmerge to handle OOM, I looked at how the precedent
was handled by running a little program which mmaps an anonymous region
of the same size as physical memory, then tries to mlock it. The
program was such an obvious candidate to be killed, I was shocked
by the poor decisions the OOM killer made. Usually I ran it with
mem=512M, with gnome and firefox active. Often the OOM killer killed
it right the first time, but went wrong when I tried it a second time
(I think that's because of what's already swapped out the first time).

I built up a patchset of fixes, but once I came to split them up for
submission, not one of them seemed entirely satisfactory; and Andrea's
fix to the KSM/mlock deadlock forced me to abandon even the first of
the patches (we've since then fixed the way munlocking behaves, so
in theory could revisit that; but Andrea disliked what I was trying
to do there in KSM for other reasons, so I've not touched it since).
I had to get on with KSM, so I set it all aside: none of the issues
was a recent regression.

I did briefly wonder about the reliance on total_vm which you're now
looking into, but didn't touch that at all. Let me describe those
issues which I did try but fail to fix - I've no more time to deal
with them now than then, but ought at least to mention them to you.

1. select_bad_process() tries to avoid killing another process while
there's still a TIF_MEMDIE, but its loop starts by skipping !p->mm
processes. However, p->mm is set to NULL well before p reaches
exit_mmap() to actually free the memory, and there may be significant
delays in between (I think exit_robust_list() gave me a hang at one
stage). So in practice, even when the OOM killer selects the right
process to kill, there can be lots of collateral damage from it not
waiting long enough for that process to give up its memory.

I tried to deal with that by moving the TIF_MEMDIE test up before
the p->mm test, but adding in a check on p->exit_state:
if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
!p->exit_state)
return ERR_PTR(-1UL);
But this is then liable to hang the system if there's some reason
why the selected process cannot proceed to free its memory (e.g.
the current KSM unmerge case). It needs to wait "a while", but
give up if no progress is made, instead of hanging: originally
I thought that setting PF_MEMALLOC more widely in page_alloc.c,
and giving up on the TIF_MEMDIE if it was waiting in PF_MEMALLOC,
would deal with that; but we cannot be sure that waiting of memory
is the only reason for a holdup there (in the KSM unmerge case it's
waiting for an mmap_sem, and there may well be other such cases).

2. I started out running my mlock test program as root (later
switched to use "ulimit -l unlimited" first). But badness() reckons
CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
and CAP_SYS_RAWIO another reason to quarter your points: so running
as root makes you sixteen times less likely to be killed. Quartering
is anyway debatable, but sixteenthing seems utterly excessive to me.

I moved the CAP_SYS_RAWIO test in with the others, so it does no
more than quartering; but is quartering appropriate anyway? I did
wonder if I was right to be "subverting" the fine-grained CAPs in
this way, but have since seen unrelated mail from one who knows
better, implying they're something of a fantasy, that su and sudo
are indeed what's used in the real world. Maybe this patch was okay.

3. badness() has a comment above it which says:
* 5) we try to kill the process the user expects us to kill, this
* algorithm has been meticulously tuned to meet the principle
* of least surprise ... (be careful when you change it)
But Andrea's 2.6.11 86a4c6d9e2e43796bb362debd3f73c0e3b198efa (later
refined by Kurt's 2.6.16 9827b781f20828e5ceb911b879f268f78fe90815)
adds plenty of surprise there, by trying to factor children into the
calculation. Intended to deal with forkbombs, but any reasonable
process whose purpose is to fork children (e.g. gnome-session)
becomes very vulnerable. And whereas badness() itself goes on to
refine the total_vm points by various adjustments peculiar to the
process in question, those refinements have been ignored when
adding the child's total_vm/2. (Andrea does remark that he'd
rather have rewritten badness() from scratch.)

I tried to fix this by moving the PF_OOM_ORIGIN (was PF_SWAPOFF)
part of the calculation up to select_bad_process(), making a
solo_badness() function which makes all those adjustments to
total_vm, then badness() itself a simple function adding half
the children's solo_badness()es to the process' own solo_badness().
But probably lots more needs doing - Andrea's rewrite?

4. In some cases those children are sharing exactly the same mm,
yet its total_vm is being added again and again to the points:
I had a nasty inner loop searching back to see if we'd already
counted this mm (but then, what if the different tasks sharing
the mm deserved different adjustments to the total_vm?).


I hope these notes help someone towards a better solution
(and be prepared to discover more on the way). I agree with
Vedran that the present behaviour is pretty unimpressive, and
I'm puzzled as to how people can have been tinkering with
oom_kill.c down the years without seeing any of this.

Hugh

2009-10-27 21:04:34

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Tue, 27 Oct 2009, Hugh Dickins wrote:

> When preparing KSM unmerge to handle OOM, I looked at how the precedent
> was handled by running a little program which mmaps an anonymous region
> of the same size as physical memory, then tries to mlock it. The
> program was such an obvious candidate to be killed, I was shocked
> by the poor decisions the OOM killer made. Usually I ran it with
> mem=512M, with gnome and firefox active. Often the OOM killer killed
> it right the first time, but went wrong when I tried it a second time
> (I think that's because of what's already swapped out the first time).
>

The heuristics that the oom killer use in selecting a task seem to get
debated quite often.

What hasn't been mentioned is that total_vm does do a good job of
identifying tasks that are using far more memory than expected. That
seems to be the initial target: killing a rogue task that is hogging much
more memory than it should, probably because of a memory leak.

The latest approach seems to be focused more on killing the task that will
free the most resident memory. That certainly is understandable to avoid
killing additional tasks later and avoiding subsequent page allocations in
the short term, but doesn't help to kill the memory leaker.

There's advantages to either approach, but it depends on the contextual
goal of the oom killer when it's called: kill a rogue task that is
allocating more memory than expected, or kill a task that will free the
most memory.

> 1. select_bad_process() tries to avoid killing another process while
> there's still a TIF_MEMDIE, but its loop starts by skipping !p->mm
> processes. However, p->mm is set to NULL well before p reaches
> exit_mmap() to actually free the memory, and there may be significant
> delays in between (I think exit_robust_list() gave me a hang at one
> stage). So in practice, even when the OOM killer selects the right
> process to kill, there can be lots of collateral damage from it not
> waiting long enough for that process to give up its memory.
>
> I tried to deal with that by moving the TIF_MEMDIE test up before
> the p->mm test, but adding in a check on p->exit_state:
> if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> !p->exit_state)
> return ERR_PTR(-1UL);
> But this is then liable to hang the system if there's some reason
> why the selected process cannot proceed to free its memory (e.g.
> the current KSM unmerge case). It needs to wait "a while", but
> give up if no progress is made, instead of hanging: originally
> I thought that setting PF_MEMALLOC more widely in page_alloc.c,
> and giving up on the TIF_MEMDIE if it was waiting in PF_MEMALLOC,
> would deal with that; but we cannot be sure that waiting of memory
> is the only reason for a holdup there (in the KSM unmerge case it's
> waiting for an mmap_sem, and there may well be other such cases).
>

I've proposed an oom killer timeout in the past which adds a jiffies count
to struct task_struct and will defer killing other tasks until the
predefined time limit (we use 10*HZ) has been exceeded. The problem is
that even if you kill another task, it is highly unlikely that the expired
task will ever exit at that point and is still holding a substantial
amount of memory since it also had access to memory reserves and has still
failed to exit.

> 2. I started out running my mlock test program as root (later
> switched to use "ulimit -l unlimited" first). But badness() reckons
> CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> and CAP_SYS_RAWIO another reason to quarter your points: so running
> as root makes you sixteen times less likely to be killed. Quartering
> is anyway debatable, but sixteenthing seems utterly excessive to me.
>
> I moved the CAP_SYS_RAWIO test in with the others, so it does no
> more than quartering; but is quartering appropriate anyway? I did
> wonder if I was right to be "subverting" the fine-grained CAPs in
> this way, but have since seen unrelated mail from one who knows
> better, implying they're something of a fantasy, that su and sudo
> are indeed what's used in the real world. Maybe this patch was okay.
>

I think someone (Nick?) proposed a patch at one time that removed most of
the heuristics from select_bad_process() other than total_vm of the task
and its children, mems_allowed intersection, and oom_adj.

> 4. In some cases those children are sharing exactly the same mm,
> yet its total_vm is being added again and again to the points:
> I had a nasty inner loop searching back to see if we'd already
> counted this mm (but then, what if the different tasks sharing
> the mm deserved different adjustments to the total_vm?).
>

oom_kill_process() may not kill the task selected by select_bad_process(),
it will first attempt to kill one of these children with a different mm.

2009-10-28 00:08:47

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

David Rientjes wrote:

> There's advantages to either approach, but it depends on the contextual
> goal of the oom killer when it's called: kill a rogue task that is
> allocating more memory than expected,

But it is wrong at counting allocated memory!
Come on, it kills /usr/lib/icedove/run-mozilla.sh. Parent, a shell
script, instead of its child(s) which allocated memory. Look, "test"
allocates some (0.1GB) memory, and you have:

% cat test.sh

#!/bin/sh
./test&
./test&
./test&
./test

% perl check_badness.pl|sort -n|g test

26511 7884 test
26511 7885 test
26511 7886 test
26511 7887 test
53994 7883 test.sh

// great, so test.sh "is" the bad ass, ok, emulate OOMK:

% kill -9 7883

// did we kill "a rogue task"

% perl check_badness.pl|sort -n|g test

26511 7884 test
26511 7885 test
26511 7886 test
26511 7887 test

// nooo, they are still alive and eating our memory!

QED by newbie. ;)

> or kill a task that will free the most memory.

.

2009-10-28 00:15:52

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

On Tue, 27 Oct 2009 18:41:22 +0100
Vedran Furač <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
>
> > On Tue, 27 Oct 2009 15:55:26 +0900
> > Minchan Kim <[email protected]> wrote:
> >
> >>>> Hmm.
> >>>> I wonder why we consider VM size for OOM kiling.
> >>>> How about RSS size?
> >>>>
> >>> Maybe the current code assumes "Tons of swap have been generated, already" if
> >>> oom-kill is invoked. Then, just using mm->anon_rss will not be correct.
> >>>
> >>> Hm, should we count # of swap entries reference from mm ?....
> >> In Vedran case, he didn't use swap. So, Only considering vm is the problem.
> >> I think it would be better to consider both RSS + # of swap entries as
> >> Kosaki mentioned.
> >>
> > Then, maybe this kind of patch is necessary.
> > This is on 2.6.31...then I may have to rebase this to mmotom.
> > Added more CCs.
> >
> > Vedran, I'm glad if you can test this patch.
>
> Thanks for the patch! I'll test it during this week a report after that.
>
> > Instead of total_vm, we should use anon/file/swap usage of a process, I think.
> > This patch adds mm->swap_usage and calculate oom_score based on
> > anon_rss + file_rss + swap_usage.
>
> Isn't file_rss shared between processes? Sorry, I'm newbie. :)
>
It's shared. But in typical case, file_rss will very small at OOM.


> % pmap $(pidof test)
> 29049: ./test
> 0000000000400000 4K r-x-- /home/vedranf/dev/tmp/test
> 0000000000600000 4K rw--- /home/vedranf/dev/tmp/test
> 00002ba362a80000 116K r-x-- /lib/ld-2.10.1.so
> 00002ba362a9d000 12K rw--- [ anon ]
> 00002ba362c9c000 4K r---- /lib/ld-2.10.1.so
> 00002ba362c9d000 4K rw--- /lib/ld-2.10.1.so
> 00002ba362c9e000 1320K r-x-- /lib/libc-2.10.1.so
> 00002ba362de8000 2044K ----- /lib/libc-2.10.1.so
> 00002ba362fe7000 16K r---- /lib/libc-2.10.1.so
> 00002ba362feb000 4K rw--- /lib/libc-2.10.1.so
> 00002ba362fec000 1024028K rw--- [ anon ] // <-- This
> 00007ffff4618000 84K rw--- [ stack ]
> 00007ffff47b7000 4K r-x-- [ anon ]
> ffffffffff600000 4K r-x-- [ anon ]
> total 1027648K
>
> I would just look at anon if that's OK (or possible).
>
> > Considering usual applications, this will be much better information than
> > total_vm.
>
> Agreed.
>
> > score PID name
> > 4033 3176 gnome-panel
> > 4077 3113 xinit
> > 4526 3190 python
> > 4820 3161 gnome-settings-
> > 4989 3289 gnome-terminal
> > 7105 3271 tomboy
> > 8427 3177 nautilus
> > 17549 3140 gnome-session
> > 128501 3299 bash
> > 256106 3383 mmap
> >
> > This order is not bad, I think.
>
> Yes, this looks much better now. Bash is only having somewhat strangely
> high score.
>
It gets half score of mmap....If mmap goes, bash's score will goes down
dramatically. I'll read other's comments and tweak this patch more.

Thanks,
-Kame



> Regards,
>
> Vedran
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2009-10-28 00:25:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

On Tue, 27 Oct 2009 13:38:10 +0100
Andrea Arcangeli <[email protected]> wrote:

> On Tue, Oct 27, 2009 at 04:56:12PM +0900, Minchan Kim wrote:
> > Thanks for making the patch.
> > Let's hear other's opinion. :)
>
> total_vm is nearly meaningless, especially on 64bit that reduces the
> mmap load on libs, I tried to change it to something "physical" (rss,
> didn't add swap too) some time ago too, not sure why I didn't manage
> to get it in. Trying again surely sounds good. Accounting swap isn't
> necessarily good, we may be killing a task that isn't accessing memory
> at all. So yes, we free swap but if the task is the "bloater" it's
> unlikely to be all in swap as it did all recent activity that lead to
> the oom. So I'm unsure if swap is good to account here, but surely I
> ack to replace virtual with rss. I would include the whole rss, as the
> file one may also be rendered unswappable if it is accessed in a loop
> refreshing the young bit all the time.
>
I wonder I'll acccounting swap and export it via /proc/<pid>/??? file.
So, I'll divide this patch into 2 part as swap accounting/oom patch.

Considering amount of swap at oom isn't very bad, I think. But using the
same weight to rss and swap is not good, maybe.

Hmm, maybe
anon_rss + file_rss/2 + swap_usage/4 + kosaki's time accounting change
can give us some better value. I'll consider what number is logical and
technically correct, again.

I'll prepare series of 2-4? patches.

Thanks,
-Kame

2009-10-28 00:25:53

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Wed, 28 Oct 2009, Vedran Fura wrote:

> But it is wrong at counting allocated memory!
> Come on, it kills /usr/lib/icedove/run-mozilla.sh. Parent, a shell
> script, instead of its child(s) which allocated memory. Look, "test"
> allocates some (0.1GB) memory, and you have:
>
> % cat test.sh
>
> #!/bin/sh
> ./test&
> ./test&
> ./test&
> ./test
>
> % perl check_badness.pl|sort -n|g test
>
> 26511 7884 test
> 26511 7885 test
> 26511 7886 test
> 26511 7887 test
> 53994 7883 test.sh
>
> // great, so test.sh "is" the bad ass, ok, emulate OOMK:
>
> % kill -9 7883
>
> // did we kill "a rogue task"
>
> % perl check_badness.pl|sort -n|g test
>
> 26511 7884 test
> 26511 7885 test
> 26511 7886 test
> 26511 7887 test
>
> // nooo, they are still alive and eating our memory!
>

This is wrong; it doesn't "emulate oom" since oom_kill_process() always
kills a child of the selected process instead if they do not share the
same memory. The chosen task in that case is untouched.

2009-10-28 00:31:22

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

On Tue, 27 Oct 2009 18:39:07 +0000 (GMT)
Hugh Dickins <[email protected]> wrote:

> On Tue, 27 Oct 2009, KAMEZAWA Hiroyuki wrote:
> > Now, oom-killer's score uses mm->total_vm as its base value.
> > But, in these days, applications like GUI program tend to use
> > much shared libraries and total_vm grows too high even when
> > pages are not fully mapped.
> >
> > For example, running a program "mmap" which allocates 1 GBbytes of
> > anonymous memory, oom_score top 10 on system will be..
> >
> > score PID name
> > 89924 3938 mixer_applet2
> > 90210 3942 tomboy
> > 94753 3936 clock-applet
> > 101994 3919 pulseaudio
> > 113525 4028 gnome-terminal
> > 127340 1 init
> > 128177 3871 nautilus
> > 151003 11515 bash
> > 256944 11653 mmap <-----------------use 1G of anon
> > 425561 3829 gnome-session
> >
> > No one believes gnome-session is more guilty than "mmap".
> >
> > Instead of total_vm, we should use anon/file/swap usage of a process, I think.
> > This patch adds mm->swap_usage and calculate oom_score based on
> > anon_rss + file_rss + swap_usage.
> > Considering usual applications, this will be much better information than
> > total_vm. After this patch, the score on my desktop is
> >
> > score PID name
> > 4033 3176 gnome-panel
> > 4077 3113 xinit
> > 4526 3190 python
> > 4820 3161 gnome-settings-
> > 4989 3289 gnome-terminal
> > 7105 3271 tomboy
> > 8427 3177 nautilus
> > 17549 3140 gnome-session
> > 128501 3299 bash
> > 256106 3383 mmap
> >
> > This order is not bad, I think.
> >
> > Note: This adss new counter...then new cost is added.
>
> I've often thought we ought to supply such a swap_usage statistic;
> and show it in /proc/pid/statsomething, presumably VmSwap in
> /proc/pid/status, even an additional field on the end of statm.
>
Hm, ok. I'll divide this patch into

- replace total_vm with anon_rss + file_rsss (everyone will agree this.)
- add swap usage accounting
- show it via /proc (may need discuss about its style.)
- use the value at oom calculation (need discuss)

> A slight new cost, yes: doesn't matter at the swapping end, but
> would slightly impact fork and exit - I do hope we can afford it,
> because I think it should have been available all along.
>
fork()/exit() uses batched counting. Then, we don't see overhead.


> I've not checked your patch in detail; but I do agree that basing
> OOM (physical memory) decisions on total_vm (virtual memory) has
> seemed weird, so it's well worth trying this approach. Whether swap
> should be included along with rss isn't quite clear to me: I'm not
> saying you're wrong, not at all, just that it's not quite obvious.
>
yes. It just comes from heuristics. It will need discuss/investigation/theory.


> I've several observations to make about bad OOM kill decisions,
> but it's probably better that I make them in the original
> "Memory overcommit" thread, rather than divert this thread.
>

Thanks,
-Kame

2009-10-28 00:34:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

On Tue, 27 Oct 2009 19:47:43 +0100
Andrea Arcangeli <[email protected]> wrote:
> > should be included along with rss isn't quite clear to me: I'm not
> > saying you're wrong, not at all, just that it's not quite obvious.
>
> Agreed it's not obvious. Intuitively I think only including RSS and no
> swap is best, but clearly I can't be entirely against including swap
> too as there may be scenarios where including swap provides for a
> better choice.
>
> My argument for not including swap is that we kill tasks to free RAM
> (we don't really care to free swap, system needs RAM at oom time).
> Freeing swap won't immediately help because no RAM is freed when swap
> is released (sure other tasks that sits huge in RAM can be moved to
> swap after swap isn't full but if we immediately killed those tasks
> that were huge in RAM in the first place we'd be better off).
>
Okay.

As first step, I'll divide this into
- replace total_vm with anon_rss/file_rss patch
- swap accounting
- a patch for consider whether swap amount should be included or not.

Then, necessary part will go early. And backport will be easy.

Thanks,
-Kame

2009-10-28 00:39:30

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

David Rientjes wrote:

> This is wrong; it doesn't "emulate oom" since oom_kill_process() always
> kills a child of the selected process instead if they do not share the
> same memory. The chosen task in that case is untouched.

OK, I stand corrected then. Thanks! But, while testing this I lost X
once again and "test" survived for some time (check the timestamps):

http://pastebin.com/d5c9d026e

- It started by killing gkrellm(!!!)
- Then I lost X (kdeinit4 I guess)
- Then 103 seconds after the killing started, it killed "test" - the
real culprit.

I mean... how?!

2009-10-28 00:46:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Memory overcommit

On Tue, 27 Oct 2009 20:44:16 +0000 (GMT)
Hugh Dickins <[email protected]> wrote:

> On Tue, 27 Oct 2009, KAMEZAWA Hiroyuki wrote:
> > Sigh, gnome-session has twice value of mmap(1G).
> > Of course, gnome-session only uses 6M bytes of anon.
> > I wonder this is because gnome-session has many children..but need to
> > dig more. Does anyone has idea ?
>
> When preparing KSM unmerge to handle OOM, I looked at how the precedent
> was handled by running a little program which mmaps an anonymous region
> of the same size as physical memory, then tries to mlock it. The
> program was such an obvious candidate to be killed, I was shocked
> by the poor decisions the OOM killer made. Usually I ran it with
> mem=512M, with gnome and firefox active. Often the OOM killer killed
> it right the first time, but went wrong when I tried it a second time
> (I think that's because of what's already swapped out the first time).
>
> I built up a patchset of fixes, but once I came to split them up for
> submission, not one of them seemed entirely satisfactory; and Andrea's
> fix to the KSM/mlock deadlock forced me to abandon even the first of
> the patches (we've since then fixed the way munlocking behaves, so
> in theory could revisit that; but Andrea disliked what I was trying
> to do there in KSM for other reasons, so I've not touched it since).
> I had to get on with KSM, so I set it all aside: none of the issues
> was a recent regression.
>
> I did briefly wonder about the reliance on total_vm which you're now
> looking into, but didn't touch that at all. Let me describe those
> issues which I did try but fail to fix - I've no more time to deal
> with them now than then, but ought at least to mention them to you.
>
Okay, thank you for detailed information.


> 1. select_bad_process() tries to avoid killing another process while
> there's still a TIF_MEMDIE, but its loop starts by skipping !p->mm
> processes. However, p->mm is set to NULL well before p reaches
> exit_mmap() to actually free the memory, and there may be significant
> delays in between (I think exit_robust_list() gave me a hang at one
> stage). So in practice, even when the OOM killer selects the right
> process to kill, there can be lots of collateral damage from it not
> waiting long enough for that process to give up its memory.
>
Hmm.

> I tried to deal with that by moving the TIF_MEMDIE test up before
> the p->mm test, but adding in a check on p->exit_state:
> if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> !p->exit_state)
> return ERR_PTR(-1UL);
> But this is then liable to hang the system if there's some reason
> why the selected process cannot proceed to free its memory (e.g.
> the current KSM unmerge case). It needs to wait "a while", but
> give up if no progress is made, instead of hanging: originally
> I thought that setting PF_MEMALLOC more widely in page_alloc.c,
> and giving up on the TIF_MEMDIE if it was waiting in PF_MEMALLOC,
> would deal with that; but we cannot be sure that waiting of memory
> is the only reason for a holdup there (in the KSM unmerge case it's
> waiting for an mmap_sem, and there may well be other such cases).
>
ok, then, easy handling can't be a help.

> 2. I started out running my mlock test program as root (later
> switched to use "ulimit -l unlimited" first). But badness() reckons
> CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> and CAP_SYS_RAWIO another reason to quarter your points: so running
> as root makes you sixteen times less likely to be killed. Quartering
> is anyway debatable, but sixteenthing seems utterly excessive to me.
>
I can't agree that part of heuristics, either.

> I moved the CAP_SYS_RAWIO test in with the others, so it does no
> more than quartering; but is quartering appropriate anyway? I did
> wonder if I was right to be "subverting" the fine-grained CAPs in
> this way, but have since seen unrelated mail from one who knows
> better, implying they're something of a fantasy, that su and sudo
> are indeed what's used in the real world. Maybe this patch was okay.
>
ok.



> 3. badness() has a comment above it which says:
> * 5) we try to kill the process the user expects us to kill, this
> * algorithm has been meticulously tuned to meet the principle
> * of least surprise ... (be careful when you change it)
> But Andrea's 2.6.11 86a4c6d9e2e43796bb362debd3f73c0e3b198efa (later
> refined by Kurt's 2.6.16 9827b781f20828e5ceb911b879f268f78fe90815)
> adds plenty of surprise there, by trying to factor children into the
> calculation. Intended to deal with forkbombs, but any reasonable
> process whose purpose is to fork children (e.g. gnome-session)
> becomes very vulnerable. And whereas badness() itself goes on to
> refine the total_vm points by various adjustments peculiar to the
> process in question, those refinements have been ignored when
> adding the child's total_vm/2. (Andrea does remark that he'd
> rather have rewritten badness() from scratch.)
>
> I tried to fix this by moving the PF_OOM_ORIGIN (was PF_SWAPOFF)
> part of the calculation up to select_bad_process(), making a
> solo_badness() function which makes all those adjustments to
> total_vm, then badness() itself a simple function adding half
> the children's solo_badness()es to the process' own solo_badness().
> But probably lots more needs doing - Andrea's rewrite?
>
> 4. In some cases those children are sharing exactly the same mm,
> yet its total_vm is being added again and again to the points:
> I had a nasty inner loop searching back to see if we'd already
> counted this mm (but then, what if the different tasks sharing
> the mm deserved different adjustments to the total_vm?).
>
>
> I hope these notes help someone towards a better solution
> (and be prepared to discover more on the way). I agree with
> Vedran that the present behaviour is pretty unimpressive, and
> I'm puzzled as to how people can have been tinkering with
> oom_kill.c down the years without seeing any of this.
>

Sorry, I usually don't use X on servers and almost all recent my OOM test
was done under memcg ;(
Thank you for your investigation. Maybe I'll need several steps.

Thanks,
-Kame

2009-10-28 00:45:24

by Vedran Furač

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

KAMEZAWA Hiroyuki wrote:

> Hmm, maybe
> anon_rss + file_rss/2 + swap_usage/4 + kosaki's time accounting change
> can give us some better value. I'll consider what number is logical and
> technically correct, again.

Although my vote doesn't count, from my experience, this formula sounds
like optimal solution. Thanks, hope it gets accepted!

Regards,

Vedran

2009-10-28 02:47:54

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Memory overcommit

> 2. I started out running my mlock test program as root (later
> switched to use "ulimit -l unlimited" first). But badness() reckons
> CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> and CAP_SYS_RAWIO another reason to quarter your points: so running
> as root makes you sixteen times less likely to be killed. Quartering
> is anyway debatable, but sixteenthing seems utterly excessive to me.
>
> I moved the CAP_SYS_RAWIO test in with the others, so it does no
> more than quartering; but is quartering appropriate anyway? I did
> wonder if I was right to be "subverting" the fine-grained CAPs in
> this way, but have since seen unrelated mail from one who knows
> better, implying they're something of a fantasy, that su and sudo
> are indeed what's used in the real world. Maybe this patch was okay.

I agree quartering is debatable.
At least, killing quartering is worth for any user, and it can be push into -stable.




>From 27331555366c908a93c2cdd780b77e421869c5af Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Wed, 28 Oct 2009 11:28:39 +0900
Subject: [PATCH] oom: Mitigate suer-user's bonus of oom-score

Currently, badness calculation code of oom contemplate following bonus.
- Super-user have quartering oom-score
- CAP_SYS_RAWIO process (e.g. database) also have quartering oom-score

The problem is, Super-users have CAP_SYS_RAWIO too. Then, they have
sixteenthing bonus. it's obviously too excessive and meaningless.

This patch fixes it.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 13 +++++--------
1 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ea2147d..40d323d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -152,18 +152,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
/*
* Superuser processes are usually more important, so we make it
* less likely that we kill those.
- */
- if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
- has_capability_noaudit(p, CAP_SYS_RESOURCE))
- points /= 4;
-
- /*
- * We don't want to kill a process with direct hardware access.
+ *
+ * Plus, We don't want to kill a process with direct hardware access.
* Not only could that mess up the hardware, but usually users
* tend to only have this flag set on applications they think
* of as important.
*/
- if (has_capability_noaudit(p, CAP_SYS_RAWIO))
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+ has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
+ has_capability_noaudit(p, CAP_SYS_RAWIO))
points /= 4;

/*
--
1.6.2.5



2009-10-28 03:19:53

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Memory overcommit

On Wed, 28 Oct 2009 11:47:55 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> > 2. I started out running my mlock test program as root (later
> > switched to use "ulimit -l unlimited" first). But badness() reckons
> > CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> > and CAP_SYS_RAWIO another reason to quarter your points: so running
> > as root makes you sixteen times less likely to be killed. Quartering
> > is anyway debatable, but sixteenthing seems utterly excessive to me.
> >
> > I moved the CAP_SYS_RAWIO test in with the others, so it does no
> > more than quartering; but is quartering appropriate anyway? I did
> > wonder if I was right to be "subverting" the fine-grained CAPs in
> > this way, but have since seen unrelated mail from one who knows
> > better, implying they're something of a fantasy, that su and sudo
> > are indeed what's used in the real world. Maybe this patch was okay.
>
> I agree quartering is debatable.
> At least, killing quartering is worth for any user, and it can be push into -stable.
>
>
>
>
> From 27331555366c908a93c2cdd780b77e421869c5af Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <[email protected]>
> Date: Wed, 28 Oct 2009 11:28:39 +0900
> Subject: [PATCH] oom: Mitigate suer-user's bonus of oom-score
>
> Currently, badness calculation code of oom contemplate following bonus.
> - Super-user have quartering oom-score
> - CAP_SYS_RAWIO process (e.g. database) also have quartering oom-score
>
> The problem is, Super-users have CAP_SYS_RAWIO too. Then, they have
> sixteenthing bonus. it's obviously too excessive and meaningless.
>
> This patch fixes it.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

I'll pick this up to my series.

Thanks,
-Kame

> ---
> mm/oom_kill.c | 13 +++++--------
> 1 files changed, 5 insertions(+), 8 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ea2147d..40d323d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -152,18 +152,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> /*
> * Superuser processes are usually more important, so we make it
> * less likely that we kill those.
> - */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> - has_capability_noaudit(p, CAP_SYS_RESOURCE))
> - points /= 4;
> -
> - /*
> - * We don't want to kill a process with direct hardware access.
> + *
> + * Plus, We don't want to kill a process with direct hardware access.
> * Not only could that mess up the hardware, but usually users
> * tend to only have this flag set on applications they think
> * of as important.
> */
> - if (has_capability_noaudit(p, CAP_SYS_RAWIO))
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> + has_capability_noaudit(p, CAP_SYS_RAWIO))
> points /= 4;
>
> /*
> --
> 1.6.2.5
>
>
>
>
>

2009-10-28 04:09:03

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Wed, 28 Oct 2009, Vedran Furac wrote:

> > This is wrong; it doesn't "emulate oom" since oom_kill_process() always
> > kills a child of the selected process instead if they do not share the
> > same memory. The chosen task in that case is untouched.
>
> OK, I stand corrected then. Thanks! But, while testing this I lost X
> once again and "test" survived for some time (check the timestamps):
>
> http://pastebin.com/d5c9d026e
>
> - It started by killing gkrellm(!!!)
> - Then I lost X (kdeinit4 I guess)
> - Then 103 seconds after the killing started, it killed "test" - the
> real culprit.
>
> I mean... how?!
>

Here are the five oom kills that occurred in your log, and notice that the
first four times it kills a child and not the actual task as I explained:

[97137.724971] Out of memory: kill process 21485 (VBoxSVC) score 1564940 or a child
[97137.725017] Killed process 21503 (VirtualBox)
[97137.864622] Out of memory: kill process 11141 (kdeinit4) score 1196178 or a child
[97137.864656] Killed process 11142 (klauncher)
[97137.888146] Out of memory: kill process 11141 (kdeinit4) score 1184308 or a child
[97137.888180] Killed process 11151 (ksmserver)
[97137.972875] Out of memory: kill process 11141 (kdeinit4) score 1146255 or a child
[97137.972888] Killed process 11224 (audacious2)

Those are practically happening simultaneously with very little memory
being available between each oom kill. Only later is "test" killed:

[97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
[97240.206832] Killed process 5005 (test)

Notice how the badness score is less than 1/4th of the others. So while
you may find it to be hogging a lot of memory, there were others that
consumed much more.

You can get a more detailed understanding of this by doing

echo 1 > /proc/sys/vm/oom_dump_tasks

before trying your testcase; it will show various information like the
total_vm and oom_adj value for each task at the time of oom (and the
actual badness score is exported per-task via /proc/pid/oom_score in
real-time). This will also include the rss and show what the end result
would be in using that value as part of the heuristic on this particular
workload compared to the current implementation.

2009-10-28 04:12:40

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Wed, 28 Oct 2009, KOSAKI Motohiro wrote:

> I agree quartering is debatable.
> At least, killing quartering is worth for any user, and it can be push into -stable.
>

Not sure where the -stable reference came from, I don't think this is a
candidate.

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ea2147d..40d323d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -152,18 +152,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> /*
> * Superuser processes are usually more important, so we make it
> * less likely that we kill those.
> - */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> - has_capability_noaudit(p, CAP_SYS_RESOURCE))
> - points /= 4;
> -
> - /*
> - * We don't want to kill a process with direct hardware access.
> + *
> + * Plus, We don't want to kill a process with direct hardware access.
> * Not only could that mess up the hardware, but usually users
> * tend to only have this flag set on applications they think
> * of as important.
> */
> - if (has_capability_noaudit(p, CAP_SYS_RAWIO))
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> + has_capability_noaudit(p, CAP_SYS_RAWIO))
> points /= 4;
>
> /*

Acked-by: David Rientjes <[email protected]>

2009-10-28 04:57:54

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Memory overcommit

On Tue, 27 Oct 2009 21:08:56 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> On Wed, 28 Oct 2009, Vedran Furac wrote:
>
> > > This is wrong; it doesn't "emulate oom" since oom_kill_process() always
> > > kills a child of the selected process instead if they do not share the
> > > same memory. The chosen task in that case is untouched.
> >
> > OK, I stand corrected then. Thanks! But, while testing this I lost X
> > once again and "test" survived for some time (check the timestamps):
> >
> > http://pastebin.com/d5c9d026e
> >
> > - It started by killing gkrellm(!!!)
> > - Then I lost X (kdeinit4 I guess)
> > - Then 103 seconds after the killing started, it killed "test" - the
> > real culprit.
> >
> > I mean... how?!
> >
>
> Here are the five oom kills that occurred in your log, and notice that the
> first four times it kills a child and not the actual task as I explained:
>
> [97137.724971] Out of memory: kill process 21485 (VBoxSVC) score 1564940 or a child
> [97137.725017] Killed process 21503 (VirtualBox)
> [97137.864622] Out of memory: kill process 11141 (kdeinit4) score 1196178 or a child
> [97137.864656] Killed process 11142 (klauncher)
> [97137.888146] Out of memory: kill process 11141 (kdeinit4) score 1184308 or a child
> [97137.888180] Killed process 11151 (ksmserver)
> [97137.972875] Out of memory: kill process 11141 (kdeinit4) score 1146255 or a child
> [97137.972888] Killed process 11224 (audacious2)
>
> Those are practically happening simultaneously with very little memory
> being available between each oom kill. Only later is "test" killed:
>
> [97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
> [97240.206832] Killed process 5005 (test)
>
> Notice how the badness score is less than 1/4th of the others. So while
> you may find it to be hogging a lot of memory, there were others that
> consumed much more.

not related to child-parent problem.

Seeing this number more.
==
[97137.709272] Active_anon:671487 active_file:82 inactive_anon:132316
[97137.709273] inactive_file:82 unevictable:50 dirty:0 writeback:0 unstable:0
[97137.709273] free:6122 slab:17179 mapped:30661 pagetables:8052 bounce:0
==

acitve_file + inactive_file is very low. Almost all pages are for anon.
But "mapped(NR_FILE_MAPPED)" is a little high. This implies remaining file caches
are mapped by many processes OR some mega bytes of shmem is used.

# of pagetables is 8052, this means
8052x4096/8*4k bytes = 16Gbytes of mapped area.

Total available memory is near to be active/inactive + slab
671487+82+132316+82+50+6122+17179+8052=835370x4k= 3.2Gbytes ?
(this system is swapless)

Then, considering the pmap kosaki shows,
I guess killed ones had big total_vm but has not much real rss,
and no helps for oom.

Thanks,
-Kame

2009-10-28 05:13:51

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:

> not related to child-parent problem.
>
> Seeing this number more.
> ==
> [97137.709272] Active_anon:671487 active_file:82 inactive_anon:132316
> [97137.709273] inactive_file:82 unevictable:50 dirty:0 writeback:0 unstable:0
> [97137.709273] free:6122 slab:17179 mapped:30661 pagetables:8052 bounce:0
> ==
>
> acitve_file + inactive_file is very low. Almost all pages are for anon.
> But "mapped(NR_FILE_MAPPED)" is a little high. This implies remaining file caches
> are mapped by many processes OR some mega bytes of shmem is used.
>
> # of pagetables is 8052, this means
> 8052x4096/8*4k bytes = 16Gbytes of mapped area.
>
> Total available memory is near to be active/inactive + slab
> 671487+82+132316+82+50+6122+17179+8052=835370x4k= 3.2Gbytes ?
> (this system is swapless)
>

Yep:

[97137.724965] 917504 pages RAM
[97137.724967] 69721 pages reserved

(917504 - 69721) * 4K = ~3.23G

> Then, considering the pmap kosaki shows,
> I guess killed ones had big total_vm but has not much real rss,
> and no helps for oom.
>

echo 1 > /proc/sys/vm/oom_dump_tasks can confirm that.

The bigger issue is making the distinction between killing a rogue task
that is using much more memory than expected (the supposed current
behavior, influenced from userspace by /proc/pid/oom_adj), and killing the
task with the highest rss. The latter is definitely desired if we are
allocating tons of memory but reduces the ability of the user to influence
the badness score.

2009-10-28 06:08:06

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Memory overcommit

On Tue, 27 Oct 2009 22:13:44 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> Yep:
>
> [97137.724965] 917504 pages RAM
> [97137.724967] 69721 pages reserved
>
> (917504 - 69721) * 4K = ~3.23G
>
> > Then, considering the pmap kosaki shows,
> > I guess killed ones had big total_vm but has not much real rss,
> > and no helps for oom.
> >
>
> echo 1 > /proc/sys/vm/oom_dump_tasks can confirm that.
>
yes.

> The bigger issue is making the distinction between killing a rogue task
> that is using much more memory than expected (the supposed current
> behavior, influenced from userspace by /proc/pid/oom_adj), and killing the
> task with the highest rss.

All kernel engineers know "than expected or not" can be never known to the kernel.
So, oom_adj workaround is used now. (by some special users.)
OOM Killer itself is also a workaround, too.
"No kill" is the best thing but we know there are tend to be memory-leaker on bad
systems and all systems in this world are not perfect.

In the kernel view, there is no difference between rogue one and highest rss one.
As heuristics, "time" is used now. But it's not very trustable.

> The latter is definitely desired if we are
> allocating tons of memory but reduces the ability of the user to influence
> the badness score.
>

Yes, some more trustable values other than vmsize/rss/time are appriciated.
I wonder recent memory consumption speed can be an another key value.

Anyway, current bahavior of "killing X" is a bad thing.
We need some fixes.

Thanks,
-Kame

2009-10-28 06:17:46

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:

> All kernel engineers know "than expected or not" can be never known to the kernel.
> So, oom_adj workaround is used now. (by some special users.)
> OOM Killer itself is also a workaround, too.
> "No kill" is the best thing but we know there are tend to be memory-leaker on bad
> systems and all systems in this world are not perfect.
>

Right, and historically that has been addressed by considering total_vm
and adjusting it with oom_adj so that we can identify memory leaking tasks
through user-defined criteria.

> Yes, some more trustable values other than vmsize/rss/time are appriciated.
> I wonder recent memory consumption speed can be an another key value.
>

Sounds very logical.

> Anyway, current bahavior of "killing X" is a bad thing.
> We need some fixes.
>

You can easily protect X with OOM_DISABLE, as you know. I don't think we
need any X-specific heuristics added to the kernel, it looks like the
special cases have already polluted badness() enough.

2009-10-28 06:22:43

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Memory overcommit

On Tue, 27 Oct 2009 23:17:41 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:
>
> > All kernel engineers know "than expected or not" can be never known to the kernel.
> > So, oom_adj workaround is used now. (by some special users.)
> > OOM Killer itself is also a workaround, too.
> > "No kill" is the best thing but we know there are tend to be memory-leaker on bad
> > systems and all systems in this world are not perfect.
> >
>
> Right, and historically that has been addressed by considering total_vm
> and adjusting it with oom_adj so that we can identify memory leaking tasks
> through user-defined criteria.
>
> > Yes, some more trustable values other than vmsize/rss/time are appriciated.
> > I wonder recent memory consumption speed can be an another key value.
> >
>
> Sounds very logical.
>
> > Anyway, current bahavior of "killing X" is a bad thing.
> > We need some fixes.
> >
>
> You can easily protect X with OOM_DISABLE, as you know. I don't think we
> need any X-specific heuristics added to the kernel, it looks like the
> special cases have already polluted badness() enough.
>
It's _not_ special to X.

Almost all applications which uses many dynamica libraries can be affected by this,
total_vm. And, as I explained to Vedran, multi-threaded program like Java can easily
increase total_vm without using many anon_rss.
And it's the reason I hate overcommit_memory. size of VM doesn't tell anything.


Thanks,
-Kame

2009-10-28 08:10:29

by Hugh Dickins

[permalink] [raw]
Subject: Re: Memory overcommit

On Tue, 27 Oct 2009, David Rientjes wrote:
>
> Not sure where the -stable reference came from, I don't think this is a
> candidate.

I agree with David, this is only one little piece of a messy puzzle,
there's no good reason to rush this into -stable.

> > + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> > + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> > + has_capability_noaudit(p, CAP_SYS_RAWIO))
>
> Acked-by: David Rientjes <[email protected]>

Acked-by: Hugh Dickins <[email protected]>

(as far as it goes: the whole thing of quartering badness here
because "we don't want to kill" and "important" is questionable;
but definitely much more open to argument both ways than sixteenthing).

2009-10-28 13:28:17

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

David Rientjes wrote:

> On Wed, 28 Oct 2009, Vedran Furac wrote:
>
>>> This is wrong; it doesn't "emulate oom" since oom_kill_process() always
>>> kills a child of the selected process instead if they do not share the
>>> same memory. The chosen task in that case is untouched.
>> OK, I stand corrected then. Thanks! But, while testing this I lost X
>> once again and "test" survived for some time (check the timestamps):
>>
>> http://pastebin.com/d5c9d026e
>>
>> - It started by killing gkrellm(!!!)
>> - Then I lost X (kdeinit4 I guess)
>> - Then 103 seconds after the killing started, it killed "test" - the
>> real culprit.
>>
>> I mean... how?!
>>
>
> Here are the five oom kills that occurred in your log, and notice that the
> first four times it kills a child and not the actual task as I explained:

Yes, but four times wrong.

> Those are practically happening simultaneously with very little memory
> being available between each oom kill. Only later is "test" killed:
>
> [97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
> [97240.206832] Killed process 5005 (test)
>
> Notice how the badness score is less than 1/4th of the others. So while
> you may find it to be hogging a lot of memory, there were others that
> consumed much more.
^^^^^^^^^^^^^^^^^^^^^

This is just wrong. I have 3.5GB of RAM, free says that 2GB are empty
(ignoring cache). Culprit then allocates all free memory (2GB). That
means it is using *more* than all other processes *together*. There
cannot be any other "that consumed much more".

> You can get a more detailed understanding of this by doing
>
> echo 1 > /proc/sys/vm/oom_dump_tasks
>
> before trying your testcase; it will show various information like the
> total_vm

Looking at total_vm (VIRT in top/vsize in ps?) is completely wrong. If I
sum up those numbers for every process running I would get:

%ps -eo pid,vsize,command|awk '{ SUM += $2} END {print SUM/1024/1024}'
14.7935

14GB. And I only have 3GB. I usually use exmap to get realistic numbers:

http://www.berthels.co.uk/exmap/doc.html

> and oom_adj value for each task at the time of oom (and the
> actual badness score is exported per-task via /proc/pid/oom_score in
> real-time). This will also include the rss and show what the end result
> would be in using that value as part of the heuristic on this particular
> workload compared to the current implementation.

Thanks, I'll try that... but I guess that using rss would yield better
results.


Regards,

Vedran

2009-10-28 20:10:47

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Wed, 28 Oct 2009, Vedran Furac wrote:

> > Those are practically happening simultaneously with very little memory
> > being available between each oom kill. Only later is "test" killed:
> >
> > [97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
> > [97240.206832] Killed process 5005 (test)
> >
> > Notice how the badness score is less than 1/4th of the others. So while
> > you may find it to be hogging a lot of memory, there were others that
> > consumed much more.
> ^^^^^^^^^^^^^^^^^^^^^
>
> This is just wrong. I have 3.5GB of RAM, free says that 2GB are empty
> (ignoring cache). Culprit then allocates all free memory (2GB). That
> means it is using *more* than all other processes *together*. There
> cannot be any other "that consumed much more".
>

Just post the oom killer results after using echo 1 >
/proc/sys/vm/oom_dump_tasks as requested and it will clarify why those
tasks were chosen to kill. It will also show the result of using rss
instead of total_vm and allow us to see how such a change would have
changed the killing order for your workload.

> Thanks, I'll try that... but I guess that using rss would yield better
> results.
>

We would know if you posted the data.

2009-10-29 03:05:47

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

David Rientjes wrote:

> We would know if you posted the data.

I need to find some free time to destroy a session on a computer which I
use for work. You could easily test it yourself also as this doesn't
happen only to me.

Anyways, here it is... this time it started with ntpd:

http://pastebin.com/f3f9674a0

Regards,

Vedran

2009-10-29 08:35:47

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Thu, 29 Oct 2009, Vedran Furac wrote:

> > We would know if you posted the data.
>
> I need to find some free time to destroy a session on a computer which I
> use for work. You could easily test it yourself also as this doesn't
> happen only to me.
>
> Anyways, here it is... this time it started with ntpd:
>
> http://pastebin.com/f3f9674a0
>

That oom log shows 12 ooms but no tasks actually appear to be getting
killed (there're no "Killed process 1234 (task)" found). Do you have any
idea why?

Anyway, as I posted in response to KAMEZAWA-san's patch, the change to
get_mm_rss(mm) prefers Xorg more than the current implementation.

>From your log at the link above:

total_vm
669624 test
195695 krunner
187342 krusader
168881 plasma-desktop
130562 ktorrent
127081 knotify4
125881 icedove-bin
123036 akregator

rss
668738 test
42191 Xorg
30761 firefox-bin
13331 icedove-bin
10234 ktorrent
9263 akregator
8864 plasma-desktop
7532 krunner

Can you explain why Xorg is preferred as a baseline to kill rather than
krunner in your example?

Thanks.

2009-10-29 08:38:28

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:

> It's _not_ special to X.
>
> Almost all applications which uses many dynamica libraries can be affected by this,
> total_vm. And, as I explained to Vedran, multi-threaded program like Java can easily
> increase total_vm without using many anon_rss.
> And it's the reason I hate overcommit_memory. size of VM doesn't tell anything.
>

Right, because in Vedran's latest oom log it shows that Xorg is preferred
more than any other thread other than the memory hogging test program with
your patch than without. I pointed out a clear distinction in the killing
order using both total_vm and rss in that log and in my opinion killing
Xorg as opposed to krunner would be undesireable.

2009-10-29 11:01:48

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

David Rientjes wrote:

> On Thu, 29 Oct 2009, Vedran Furac wrote:
>
>>> We would know if you posted the data.
>> I need to find some free time to destroy a session on a computer which I
>> use for work. You could easily test it yourself also as this doesn't
>> happen only to me.
>>
>> Anyways, here it is... this time it started with ntpd:
>>
>> http://pastebin.com/f3f9674a0
>>
>
> That oom log shows 12 ooms but no tasks actually appear to be getting
> killed (there're no "Killed process 1234 (task)" found). Do you have any
> idea why?

That's /var/log/messages. I posted it and not dmesg because whole log
didn't fit dmesg buffer, here is waht i have (compare timestamps):

% dmesg|grep -i kill

[ 1493.064458] Out of memory: kill process 6304 (kdeinit4) score 1190231
or a child
[ 1493.064467] Killed process 6409 (konqueror)
[ 1493.261149] knotify4 invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1493.261166] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1493.276528] Out of memory: kill process 6304 (kdeinit4) score 1161265
or a child
[ 1493.276538] Killed process 6411 (krusader)
[ 1499.221160] akregator invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1499.221178] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1499.236431] Out of memory: kill process 6304 (kdeinit4) score 1067593
or a child
[ 1499.236441] Killed process 6412 (irexec)
[ 1499.370192] firefox-bin invoked oom-killer: gfp_mask=0x201da,
order=0, oomkilladj=0
[ 1499.370209] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1499.385417] Out of memory: kill process 6304 (kdeinit4) score 1066861
or a child
[ 1499.385427] Killed process 6420 (xchm)
[ 1499.458304] kio_file invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1499.458333] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1499.458367] [<ffffffff81120900>] ? d_kill+0x5c/0x7c
[ 1499.473573] Out of memory: kill process 6304 (kdeinit4) score 1043690
or a child
[ 1499.473582] Killed process 6425 (kio_file)
[ 1500.250746] korgac invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1500.250765] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1500.266186] Out of memory: kill process 6304 (kdeinit4) score 1020350
or a child
[ 1500.266196] Killed process 6464 (icedove)
[ 1500.349355] syslog-ng invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1500.349371] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1500.364689] Out of memory: kill process 6304 (kdeinit4) score 1019864
or a child
[ 1500.364699] Killed process 6477 (kio_http)
[ 1500.452151] kded4 invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1500.452167] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1500.452196] [<ffffffff81120900>] ? d_kill+0x5c/0x7c
[ 1500.467307] Out of memory: kill process 6304 (kdeinit4) score 993142
or a child
[ 1500.467316] Killed process 6478 (kio_http)
[ 1500.780222] akregator invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1500.780239] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1500.796280] Out of memory: kill process 6304 (kdeinit4) score 966331
or a child
[ 1500.796290] Killed process 6484 (kio_http)
[ 1501.065374] syslog-ng invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1501.065390] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1501.080579] Out of memory: kill process 6304 (kdeinit4) score 939434
or a child
[ 1501.080587] Killed process 6486 (kio_http)
[ 1501.381188] knotify4 invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1501.381204] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1501.396338] Out of memory: kill process 6304 (kdeinit4) score 912691
or a child
[ 1501.396346] Killed process 6487 (firefox-bin)
[ 1502.661294] icedove-bin invoked oom-killer: gfp_mask=0x201da,
order=0, oomkilladj=0
[ 1502.661311] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1502.676563] Out of memory: kill process 7580 (test) score 708945 or a
child
[ 1502.676575] Killed process 7580 (test)


> Can you explain why Xorg is preferred as a baseline to kill rather than
> krunner in your example?

Krunner is a small app for running other apps and do similar things. It
shouldn't use a lot of memory. OTOH, Xorg has to hold all the pixmaps
and so on. That was expected result. Fist Xorg, then firefox and
thunderbird.

2009-10-29 11:11:31

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

David Rientjes wrote:

> Right, because in Vedran's latest oom log it shows that Xorg is preferred
> more than any other thread other than the memory hogging test program with
> your patch than without. I pointed out a clear distinction in the killing
> order using both total_vm and rss in that log and in my opinion killing
> Xorg as opposed to krunner would be undesireable.

But then you should rename OOM killer to TRIPK:
Totally Random Innocent Process Killer

If you have OOM situation and Xorg is the first, that means it's leaking
memory badly and the system is probably already frozen/FUBAR. Killing
krunner in that situation wouldn't do any good. From a user perspective,
nothing changes, system is still FUBAR and (s)he would probably reboot
cursing linux in the process.

2009-10-29 19:42:50

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Thu, 29 Oct 2009, Vedran Furac wrote:

> [ 1493.064458] Out of memory: kill process 6304 (kdeinit4) score 1190231
> or a child
> [ 1493.064467] Killed process 6409 (konqueror)
> [ 1493.261149] knotify4 invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1493.261166] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1493.276528] Out of memory: kill process 6304 (kdeinit4) score 1161265
> or a child
> [ 1493.276538] Killed process 6411 (krusader)
> [ 1499.221160] akregator invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1499.221178] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1499.236431] Out of memory: kill process 6304 (kdeinit4) score 1067593
> or a child
> [ 1499.236441] Killed process 6412 (irexec)
> [ 1499.370192] firefox-bin invoked oom-killer: gfp_mask=0x201da,
> order=0, oomkilladj=0
> [ 1499.370209] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1499.385417] Out of memory: kill process 6304 (kdeinit4) score 1066861
> or a child
> [ 1499.385427] Killed process 6420 (xchm)
> [ 1499.458304] kio_file invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1499.458333] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1499.458367] [<ffffffff81120900>] ? d_kill+0x5c/0x7c
> [ 1499.473573] Out of memory: kill process 6304 (kdeinit4) score 1043690
> or a child
> [ 1499.473582] Killed process 6425 (kio_file)
> [ 1500.250746] korgac invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.250765] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.266186] Out of memory: kill process 6304 (kdeinit4) score 1020350
> or a child
> [ 1500.266196] Killed process 6464 (icedove)
> [ 1500.349355] syslog-ng invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.349371] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.364689] Out of memory: kill process 6304 (kdeinit4) score 1019864
> or a child
> [ 1500.364699] Killed process 6477 (kio_http)
> [ 1500.452151] kded4 invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.452167] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.452196] [<ffffffff81120900>] ? d_kill+0x5c/0x7c
> [ 1500.467307] Out of memory: kill process 6304 (kdeinit4) score 993142
> or a child
> [ 1500.467316] Killed process 6478 (kio_http)
> [ 1500.780222] akregator invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.780239] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.796280] Out of memory: kill process 6304 (kdeinit4) score 966331
> or a child
> [ 1500.796290] Killed process 6484 (kio_http)
> [ 1501.065374] syslog-ng invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1501.065390] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1501.080579] Out of memory: kill process 6304 (kdeinit4) score 939434
> or a child
> [ 1501.080587] Killed process 6486 (kio_http)
> [ 1501.381188] knotify4 invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1501.381204] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1501.396338] Out of memory: kill process 6304 (kdeinit4) score 912691
> or a child
> [ 1501.396346] Killed process 6487 (firefox-bin)
> [ 1502.661294] icedove-bin invoked oom-killer: gfp_mask=0x201da,
> order=0, oomkilladj=0
> [ 1502.661311] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1502.676563] Out of memory: kill process 7580 (test) score 708945 or a
> child
> [ 1502.676575] Killed process 7580 (test)
>

Ok, so this is the forkbomb problem by adding half of each child's
total_vm into the badness score of the parent. We should address this
completely seperately by addressing that specific part of the heuristic,
not changing what we consider to be a baseline.

The rationale is quite simple: we'll still experience the same problem
with rss as we did with total_vm in the forkbomb scenario above on certain
workloads (maybe not yours, but others). The oom killer always kills a
child first if it has a different mm than the selected parent, so the
amount of memory freeing as a result of that is entirely dependent on the
order of the child list. It may be very little, but killed because its
siblings had large total_vm values.

So instead of focusing on rss, we simply need to find a better heuristic
for the forkbomb issue which I've already proposed a very trivial solution
for. Then, afterwards, we can debate about how the scoring heuristic can
be changed to select better tasks (and perhaps remove a lot of the clutter
that's there currently!).

> > Can you explain why Xorg is preferred as a baseline to kill rather than
> > krunner in your example?
>
> Krunner is a small app for running other apps and do similar things. It
> shouldn't use a lot of memory. OTOH, Xorg has to hold all the pixmaps
> and so on. That was expected result. Fist Xorg, then firefox and
> thunderbird.
>

You're making all these claims and assertions based _solely_ on the theory
that killing the application with the most resident RAM is always the
optimal solution. That's just not true, especially if we're just
allocating small numbers of order-0 memory.

Much better is to allow the user to decide at what point, regardless of
swap usage, their application is using much more memory than expected or
required. They can do that right now pretty well with /proc/pid/oom_adj
without this outlandish claim that they should be expected to know the rss
of their applications at the time of oom to effectively tune oom_adj.

What would you suggest? A script that sits in a loop checking each task's
current rss from /proc/pid/stat or their current oom priority though
/proc/pid/oom_score and adjusting oom_adj preemptively just in case the
oom killer is invoked in the next second?

And that "small app" has 30MB of rss which could be freed, if killed, and
utilized for subsequent page allocations.

2009-10-29 19:53:54

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Thu, 29 Oct 2009, Vedran Furac wrote:

> But then you should rename OOM killer to TRIPK:
> Totally Random Innocent Process Killer
>

The randomness here is the order of the child list when the oom killer
selects a task, based on the badness score, and then tries to kill a child
with a different mm before the parent.

The problem you identified in http://pastebin.com/f3f9674a0, however, is a
forkbomb issue where the badness score should never have been so high for
kdeinit4 compared to "test". That's directly proportional to adding the
scores of all disjoint child total_vm values into the badness score for
the parent and then killing the children instead.

That's the problem, not using total_vm as a baseline. Replacing that with
rss is not going to solve the issue and reducing the user's ability to
specify a rough oom priority from userspace is simply not an option.

> If you have OOM situation and Xorg is the first, that means it's leaking
> memory badly and the system is probably already frozen/FUBAR. Killing
> krunner in that situation wouldn't do any good. From a user perspective,
> nothing changes, system is still FUBAR and (s)he would probably reboot
> cursing linux in the process.
>

It depends on what you're running, we need to be able to have the option
of protecting very large tasks on production servers. Imagine if "test"
here is actually a critical application that we need to protect, its
not solely mlocked anonymous memory, but still kill if it is leaking
memory beyond your approximate 2.5GB. How do you do that when using rss
as the baseline?

2009-10-29 23:51:09

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Memory overcommit

On Thu, 29 Oct 2009 12:53:42 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> > If you have OOM situation and Xorg is the first, that means it's leaking
> > memory badly and the system is probably already frozen/FUBAR. Killing
> > krunner in that situation wouldn't do any good. From a user perspective,
> > nothing changes, system is still FUBAR and (s)he would probably reboot
> > cursing linux in the process.
> >
>
> It depends on what you're running, we need to be able to have the option
> of protecting very large tasks on production servers. Imagine if "test"
> here is actually a critical application that we need to protect, its
> not solely mlocked anonymous memory, but still kill if it is leaking
> memory beyond your approximate 2.5GB. How do you do that when using rss
> as the baseline?

As I wrote repeatedly,

- OOM-Killer itselfs is bad thing, bad situation.
- The kernel can't know the program is bad or not. just guess it.
- Then, there is no "correct" OOM-Killer other than fork-bomb killer.
- User has a knob as oom_adj. This is very strong.

Then, there is only "reasonable" or "easy-to-understand" OOM-Kill.
"Current biggest memory eater is killed" sounds reasonable, easy to
understand. And if total_vm works well, overcommit_guess should catch it.
Please improve overcommit_guess if you want to stay on total_vm.


Thanks,
-Kame

2009-10-30 09:10:46

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Fri, 30 Oct 2009, KAMEZAWA Hiroyuki wrote:

> As I wrote repeatedly,
>
> - OOM-Killer itselfs is bad thing, bad situation.

Not necessarily, the memory controller and cpusets uses it quite often to
enforce it's policy and is standard runtime behavior. We'd like to
imagine that our cpuset will never be too small to run all the attached
jobs, but that happens and we can easily recover from it by killing a
task.

> - The kernel can't know the program is bad or not. just guess it.

Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
can tell the kernel what we'd like the oom killer behavior should be if
the situation arises.

> - Then, there is no "correct" OOM-Killer other than fork-bomb killer.

Well of course there is, you're seeing this is a WAY too simplistic
manner. If we are oom, we want to be able to influence how the oom killer
behaves and respond to that situation. You are proposing that we change
the baseline for how the oom killer selects tasks which we use CONSTANTLY
as part of our normal production environment. I'd appreciate it if you'd
take it a little more seriously.

> - User has a knob as oom_adj. This is very strong.
>

Agreed.

> Then, there is only "reasonable" or "easy-to-understand" OOM-Kill.
> "Current biggest memory eater is killed" sounds reasonable, easy to
> understand. And if total_vm works well, overcommit_guess should catch it.
> Please improve overcommit_guess if you want to stay on total_vm.
>

I don't necessarily want to stay on total_vm, but I also don't want to
move to rss as a baseline, as you would probably agree.

We disagree about a very fundamental principle: you are coming from a
perspective of always wanting to kill the biggest resident memory eater
even for a single order-0 allocation that fails and I'm coming from a
perspective of wanting to ensure that our machines know how the oom killer
will react when it is used. Moving to rss reduces the ability of the user
to specify an expected oom priority other than polarizing it by either
disabling it completely with an oom_adj value of -17 or choosing the
definite next victim with +15. That's my objection to it: the user cannot
possibly be expected to predict what proportion of each application's
memory will be resident at the time of oom.

I understand you want to totally rewrite the oom killer for whatever
reason, but I think you need to spend a lot more time understanding the
needs that the Linux community has for its behavior instead of insisting
on your point of view.

2009-10-30 09:39:09

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Memory overcommit

On Fri, 30 Oct 2009 02:10:37 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> > - The kernel can't know the program is bad or not. just guess it.
>
> Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
> can tell the kernel what we'd like the oom killer behavior should be if
> the situation arises.
>

My point is that the server cannot distinguish memory leak from intentional
memory usage. No other than that.



> > - Then, there is no "correct" OOM-Killer other than fork-bomb killer.
>
> Well of course there is, you're seeing this is a WAY too simplistic
> manner. If we are oom, we want to be able to influence how the oom killer
> behaves and respond to that situation. You are proposing that we change
> the baseline for how the oom killer selects tasks which we use CONSTANTLY
> as part of our normal production environment. I'd appreciate it if you'd
> take it a little more seriously.
>
Yes, I'm serious.

In this summer, at lunch with a daily linux user, I was said
"you, enterprise guys, don't consider desktop or laptop problem at all."
yes, I use only servers. My customer uses server, too. My first priority
is always on server users.
But, for this time, I wrote reply to Vedran and try to fix desktop problem.
Even if current logic works well for servers, "KDE/GNOME is killed" problem
seems to be serious. And this may be a problem for EMBEDED people, I guess.


> > - User has a knob as oom_adj. This is very strong.
> >
>
> Agreed.
>
This and memcg are very useful. But everone says "bad workaround" ;(
Maybe only servers can use these functions.

> > Then, there is only "reasonable" or "easy-to-understand" OOM-Kill.
> > "Current biggest memory eater is killed" sounds reasonable, easy to
> > understand. And if total_vm works well, overcommit_guess should catch it.
> > Please improve overcommit_guess if you want to stay on total_vm.
> >
>
> I don't necessarily want to stay on total_vm, but I also don't want to
> move to rss as a baseline, as you would probably agree.
>
I'll rewrite all. I'll not rely only on rss. There are several situations
and we need some more information than we have know. I'll have to implement
ways to gather information before chaging badness.


> We disagree about a very fundamental principle: you are coming from a
> perspective of always wanting to kill the biggest resident memory eater
> even for a single order-0 allocation that fails and I'm coming from a
> perspective of wanting to ensure that our machines know how the oom killer
> will react when it is used.
yes.

> Moving to rss reduces the ability of the user to specify an expected oom
> priority other than polarizing it by either
> disabling it completely with an oom_adj value of -17 or choosing the
> definite next victim with +15. That's my objection to it: the user cannot
> possibly be expected to predict what proportion of each application's
> memory will be resident at the time of oom.
>
I can say the same thing to total_vm size. total_vm size doesn't include any
good information for oom situation. And tweaking based on that not-useful
parameter will make things worse.

For oom_adj tweak, we may need other technique other than "shift".
If I've wrote oom_adj, I'll write it as

/proc/<pid>/guarantee_nooom_size

#echo 3G > /proc/<pid>/guarantee_nooom_size

Then, 3G bytes of this process's memory usage will not be accounted to badness.

I'm not sure I can add new interface or replace oom_adj, now.
But to do this, current chilren's score problem etc...should be fixed.

> I understand you want to totally rewrite the oom killer for whatever
> reason, but I think you need to spend a lot more time understanding the
> needs that the Linux community has for its behavior instead of insisting
> on your point of view.
>
yes, use more time. I don't think all of changes can be in quick work.

To be honest, this is a part of work to implement "custom oom handler" cgroup.
Before going further, I'd like to fix current problem.

Thanks,
-Kame

2009-10-30 10:49:15

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: Memory overcommit

On Fri October 30 2009, KAMEZAWA Hiroyuki wrote:
> On Fri, 30 Oct 2009 02:10:37 -0700 (PDT)
>
> David Rientjes <[email protected]> wrote:
> > > - The kernel can't know the program is bad or not. just guess it.
> >
> > Totally irrelevant, given your fourth point about /proc/pid/oom_adj.
> > We can tell the kernel what we'd like the oom killer behavior should be
> > if the situation arises.
>
> My point is that the server cannot distinguish memory leak from
> intentional memory usage. No other than that.
>
> > > - Then, there is no "correct" OOM-Killer other than fork-bomb
> > > killer.
> >
> > Well of course there is, you're seeing this is a WAY too simplistic
> > manner. If we are oom, we want to be able to influence how the oom
> > killer behaves and respond to that situation. You are proposing that
> > we change the baseline for how the oom killer selects tasks which we
> > use CONSTANTLY as part of our normal production environment. I'd
> > appreciate it if you'd take it a little more seriously.
>
> Yes, I'm serious.
>
> In this summer, at lunch with a daily linux user, I was said
> "you, enterprise guys, don't consider desktop or laptop problem at all."
> yes, I use only servers. My customer uses server, too. My first priority
> is always on server users.
> But, for this time, I wrote reply to Vedran and try to fix desktop
> problem. Even if current logic works well for servers, "KDE/GNOME is
> killed" problem seems to be serious. And this may be a problem for
> EMBEDED people, I guess.

Whats worse is a friend of mine gets stuck with a useless machine for a
couple hours or more when oom tries to do its thing. It swap storms for
hours. Not a good thing imo.

[snip]


--
Thomas Fjellstrom
[email protected]

2009-10-30 13:53:34

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

David Rientjes wrote:

> Ok, so this is the forkbomb problem by adding half of each child's
> total_vm into the badness score of the parent. We should address this
> completely seperately by addressing that specific part of the heuristic,
> not changing what we consider to be a baseline.
> thunderbird.
>
> You're making all these claims and assertions based _solely_ on the theory
> that killing the application with the most resident RAM is always the
> optimal solution. That's just not true, especially if we're just
> allocating small numbers of order-0 memory.

Well, you are kernel hacker, not me. You know how linux mm works much
more than I do. I just reported a, what I think is a big problem, which
needs to be solved ASAP (2.6.33). I'm afraid that we'll just talk much
and nothing will be done with solution/fix postponed indefinitely. Not
sure if you are interested, but I tested this on windowsxp also, and
nothing bad happens there, system continues to function properly.

For 2-3 years I had memory overcommit turn off. I didn't get any OOM,
but sometimes Java didn't work and it seems that because of some kernel
weirdness (or misunderstanding on my part) I couldn't use all the
available memory:

# echo 2 > /proc/sys/vm/overcommit_memory

# echo 95 > /proc/sys/vm/overcommit_ratio
% ./test /* malloc in loop as before */
malloc: Cannot allocate memory /* Great, no OOM, but: */

% free -m
total used free shared buffers cached
Mem: 3458 3429 29 0 102 1119
-/+ buffers/cache: 2207 1251

There's plenty of memory available. Shouldn't cache be automatically
dropped (this question was in my original mail, hence the subject)?

All this frustrated not only me, but a great number of users on our
local Croatian linux usenet newsgroup with some of them pointing that as
the reason they use solaris. And so on...

> Much better is to allow the user to decide at what point, regardless of
> swap usage, their application is using much more memory than expected or
> required. They can do that right now pretty well with /proc/pid/oom_adj
> without this outlandish claim that they should be expected to know the rss
> of their applications at the time of oom to effectively tune oom_adj.

Believe me, barely a few developers use oom_adj for their applications,
and probably almost none of the end users. What should they do, every
time they start an application, go to console and set the oom_adj. You
cannot expect them to do that.

> What would you suggest? A script that sits in a loop checking each task's
> current rss from /proc/pid/stat or their current oom priority though
> /proc/pid/oom_score and adjusting oom_adj preemptively just in case the
> oom killer is invoked in the next second?

:)

2009-10-30 13:59:36

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

David Rientjes wrote:

> On Thu, 29 Oct 2009, Vedran Furac wrote:
>
>> But then you should rename OOM killer to TRIPK:
>> Totally Random Innocent Process Killer
>>
>
> The randomness here is the order of the child list when the oom killer
> selects a task, based on the badness score, and then tries to kill a child
> with a different mm before the parent.
>
> The problem you identified in http://pastebin.com/f3f9674a0, however, is a
> forkbomb issue where the badness score should never have been so high for
> kdeinit4 compared to "test". That's directly proportional to adding the
> scores of all disjoint child total_vm values into the badness score for
> the parent and then killing the children instead.

Could you explain me why ntpd invoked oom killer? Its parent is init. Or
syslog-ng?

> That's the problem, not using total_vm as a baseline. Replacing that with
> rss is not going to solve the issue and reducing the user's ability to
> specify a rough oom priority from userspace is simply not an option.

OK then, if you have a solution, I would be glad to test your patch. I
won't care much if you don't change total_vm as a baseline. Just make
random killing history.

Regards,

Vedran

2009-10-30 14:08:44

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: Memory overcommit

On Fri October 30 2009, Vedran Furač wrote:
> David Rientjes wrote:
> > Ok, so this is the forkbomb problem by adding half of each child's
> > total_vm into the badness score of the parent. We should address this
> > completely seperately by addressing that specific part of the
> > heuristic, not changing what we consider to be a baseline.
> > thunderbird.
> >
> > You're making all these claims and assertions based _solely_ on the
> > theory that killing the application with the most resident RAM is
> > always the optimal solution. That's just not true, especially if we're
> > just allocating small numbers of order-0 memory.
>
> Well, you are kernel hacker, not me. You know how linux mm works much
> more than I do. I just reported a, what I think is a big problem, which
> needs to be solved ASAP (2.6.33). I'm afraid that we'll just talk much
> and nothing will be done with solution/fix postponed indefinitely. Not
> sure if you are interested, but I tested this on windowsxp also, and
> nothing bad happens there, system continues to function properly.
>
> For 2-3 years I had memory overcommit turn off. I didn't get any OOM,
> but sometimes Java didn't work and it seems that because of some kernel
> weirdness (or misunderstanding on my part) I couldn't use all the
> available memory:
>
> # echo 2 > /proc/sys/vm/overcommit_memory
>
> # echo 95 > /proc/sys/vm/overcommit_ratio
> % ./test /* malloc in loop as before */
> malloc: Cannot allocate memory /* Great, no OOM, but: */
>
> % free -m
> total used free shared buffers cached
> Mem: 3458 3429 29 0 102 1119
> -/+ buffers/cache: 2207 1251
>
> There's plenty of memory available. Shouldn't cache be automatically
> dropped (this question was in my original mail, hence the subject)?
>
> All this frustrated not only me, but a great number of users on our
> local Croatian linux usenet newsgroup with some of them pointing that as
> the reason they use solaris. And so on...

I think this is the MOST serious issue related to the oom killer. For some
reason it refuses to drop pages before trying to kill. When it should drop
cache, THEN kill if needed.

> > Much better is to allow the user to decide at what point, regardless of
> > swap usage, their application is using much more memory than expected
> > or required. They can do that right now pretty well with
> > /proc/pid/oom_adj without this outlandish claim that they should be
> > expected to know the rss of their applications at the time of oom to
> > effectively tune oom_adj.
>
> Believe me, barely a few developers use oom_adj for their applications,
> and probably almost none of the end users. What should they do, every
> time they start an application, go to console and set the oom_adj. You
> cannot expect them to do that.
>
> > What would you suggest? A script that sits in a loop checking each
> > task's current rss from /proc/pid/stat or their current oom priority
> > though /proc/pid/oom_score and adjusting oom_adj preemptively just in
> > case the oom killer is invoked in the next second?
> >
> :)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


--
Thomas Fjellstrom
[email protected]

2009-10-30 14:13:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Memory overcommit

On Fri, Oct 30, 2009 at 02:53:33PM +0100, Vedran Furač wrote:
> % free -m
> total used free shared buffers cached
> Mem: 3458 3429 29 0 102 1119
> -/+ buffers/cache: 2207 1251
>
> There's plenty of memory available. Shouldn't cache be automatically
> dropped (this question was in my original mail, hence the subject)?

This is not about cache, cache amount is physical, this about
virtual amount that can only go in ram or swap (at any later time,
current time is irrelevant) vs "ram + swap". In short add more swap if
you don't like overcommit and check grep Commit /proc/meminfo in case
this is accounting bug...

2009-10-30 14:41:13

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

Andrea Arcangeli wrote:

> On Fri, Oct 30, 2009 at 02:53:33PM +0100, Vedran Furač wrote:
>> % free -m
>> total used free shared buffers cached
>> Mem: 3458 3429 29 0 102 1119
>> -/+ buffers/cache: 2207 1251
>>
>> There's plenty of memory available. Shouldn't cache be automatically
>> dropped (this question was in my original mail, hence the subject)?
>
> This is not about cache, cache amount is physical, this about
> virtual amount that can only go in ram or swap (at any later time,
> current time is irrelevant) vs "ram + swap".

Oh... so this is because apps "reserve" (Committed_AS?) more then they
currently need.

> In short add more swap if
> you don't like overcommit and check grep Commit /proc/meminfo in case
> this is accounting bug...

A the time of "malloc: Cannot allocate memory":

CommitLimit: 3364440 kB
Committed_AS: 3240200 kB

So probably everything is ok (and free is misleading). Overcommit is
unfortunately necessary if I want to be able to use all my memory.

Btw. http://www.redhat.com/advice/tips/meminfo.html says Committed_AS is
a (gu)estimate. Hope it is a good (not to high) guesstimate. :)

Regards,

Vedran

2009-10-30 15:13:23

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

Thomas Fjellstrom wrote:

>> malloc: Cannot allocate memory /* Great, no OOM, but: */
>>
>> % free -m total used free shared buffers cached
>> Mem: 3458 3429 29 0 102 1119
>> -/+ buffers/cache: 2207 1251
>>
>> There's plenty of memory available. Shouldn't cache be
>> automatically dropped (this question was in my original mail, hence
>> the subject)?
>>
>
> I think this is the MOST serious issue related to the oom killer. For
> some reason it refuses to drop pages before trying to kill. When it
> should drop cache, THEN kill if needed.

This isn't about OOM, but situation when you turn off overcommit. I was
jumping to conclusion here. You can drop caches manually with:
# echo 1 > /proc/sys/vm/drop_caches

but you still get: "malloc: Cannot allocate memory" even if almost
nothing is cached:

total used free shared buffers cached
Mem: 3458 2210 1248 0 3 90
-/+ buffers/cache: 2116 1342

As for not dropping pages by kernel before killing, I don't know nothing
about it. It happens so fast and I never tried to measure it.

2009-10-30 15:16:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Memory overcommit

On Fri, Oct 30, 2009 at 03:41:12PM +0100, Vedran Furač wrote:
> Oh... so this is because apps "reserve" (Committed_AS?) more then they
> currently need.

They don't actually reserve, they end up "reserving" if overcommit is
set to 2 (OVERCOMMIT_NEVER)... Apps aren't reserving, more likely they
simply avoid a flood of mmap when a single one is enough to map an
huge MAP_PRIVATE region like shared libs that you may only execute
partially (this is why total_vm is usually much bigger than real ram
mapped by pagetables represented in rss). But those shared libs are
99% pageable and they don't need to stay in swap or ram, so
overcommit-as greatly overstimates the actual needs even if shared lib
loading wouldn't be 64bit optimized (i.e. large and a single one).

> A the time of "malloc: Cannot allocate memory":
>
> CommitLimit: 3364440 kB
> Committed_AS: 3240200 kB
>
> So probably everything is ok (and free is misleading). Overcommit is
> unfortunately necessary if I want to be able to use all my memory.

Add more swap.

> Btw. http://www.redhat.com/advice/tips/meminfo.html says Committed_AS is
> a (gu)estimate. Hope it is a good (not to high) guesstimate. :)

It is a guess in the sense to guarantee no ENOMEM it has to take into
account the worst possible case, that is all shared lib MAP_PRIVATE
mappings are cowed, which is very far from reality. Other than that
the overcommitas should exactly match all mmapped possibly writeable
space that can only fit in ram+swap, so from that point of view it's
not a guessed number (modulo the smp read out of order). The only
guess is how much slab, cache and other stuff is freeable, which
doesn't provide true perfection to OVERCOMMIT_NEVER.

2009-10-30 16:24:32

by Hugh Dickins

[permalink] [raw]
Subject: Re: Memory overcommit

On Fri, 30 Oct 2009, Andrea Arcangeli wrote:
>
> It is a guess in the sense to guarantee no ENOMEM it has to take into
> account the worst possible case, that is all shared lib MAP_PRIVATE
> mappings are cowed, which is very far from reality.

A MAP_PRIVATE area is only counted into Committed_AS when it is or
has in the past been PROT_WRITE. I think it's up to the ELF header
of the shared library whether a section is PROT_WRITE or not; but it
looks like many are not, so Committed_AS should be (a little) nearer
reality than you fear.

Though we do account for Committed_AS, even while allowing overcommit,
we do not at present account for Committed_AS per mm. Seeing David
and KAMEZAWA-san debating over total_vm versus rss versus anon_rss,
I wonder whether such a "commit" count might be a better measure for
OOM choices (but shmem is as usual awkward: though accounted just once
in Committed_AS, it would probably have to be accounted to every mm
that maps it). Just an idea to throw into the mix.

Hugh

2009-10-30 19:24:16

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Fri, 30 Oct 2009, Vedran Furac wrote:

> > The problem you identified in http://pastebin.com/f3f9674a0, however, is a
> > forkbomb issue where the badness score should never have been so high for
> > kdeinit4 compared to "test". That's directly proportional to adding the
> > scores of all disjoint child total_vm values into the badness score for
> > the parent and then killing the children instead.
>
> Could you explain me why ntpd invoked oom killer? Its parent is init. Or
> syslog-ng?
>

Because it attempted an order-0 GFP_USER allocation and direct reclaim
could not free any pages.

The task that invoked the oom killer is simply the unlucky task that tried
an allocation that couldn't be satisified through direct reclaim. It's
usually unrelated to the task chosen for kill unless
/proc/sys/vm/oom_kill_allocating_task is enabled (which SGI requested to
avoid excessively long tasklist scans).

> > That's the problem, not using total_vm as a baseline. Replacing that with
> > rss is not going to solve the issue and reducing the user's ability to
> > specify a rough oom priority from userspace is simply not an option.
>
> OK then, if you have a solution, I would be glad to test your patch. I
> won't care much if you don't change total_vm as a baseline. Just make
> random killing history.
>

The only randomness is in selecting a task that has a different mm from
the parent in the order of its child list. Yes, that can be addressed by
doing a smarter iteration through the children before killing one of them.

Keep in mind that a heuristic as simple as this:

- kill the task that was started most recently by the same uid, or

- kill the task that was started most recently on the system if a root
task calls the oom killer,

would have yielded perfect results for your testcase but isn't necessarily
something that we'd ever want to see.

2009-10-30 19:44:21

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Fri, 30 Oct 2009, Vedran Furac wrote:

> Well, you are kernel hacker, not me. You know how linux mm works much
> more than I do. I just reported a, what I think is a big problem, which
> needs to be solved ASAP (2.6.33).

The oom killer heuristics have not been changed recently, why is this
suddenly a problem that needs to be immediately addressed? The heuristics
you've been referring to have been used for at least three years.

> I'm afraid that we'll just talk much
> and nothing will be done with solution/fix postponed indefinitely. Not
> sure if you are interested, but I tested this on windowsxp also, and
> nothing bad happens there, system continues to function properly.
>

I'm totally sympathetic to testcases such as your own where the oom killer
seems to react in an undesirable way. I agree that it could do a much
better job at targeting "test" and killing it without negatively impacting
other tasks.

However, I don't think we can simply change the baseline (like the rss
change which has been added to -mm (??)) and consider it a major
improvement when it severely impacts how system administrators are able to
tune the badness heuristic from userspace via /proc/pid/oom_adj. I'm sure
you'd agree that user input is important in this matter and so that we
should maximize that ability rather than make it more difficult. That's
my main criticism of the suggestions thus far (and, sorry, but I have to
look out for production server interests here: you can't take away our
ability to influence oom badness scoring just because other simple
heuristics may be more understandable).

> > Much better is to allow the user to decide at what point, regardless of
> > swap usage, their application is using much more memory than expected or
> > required. They can do that right now pretty well with /proc/pid/oom_adj
> > without this outlandish claim that they should be expected to know the rss
> > of their applications at the time of oom to effectively tune oom_adj.
>
> Believe me, barely a few developers use oom_adj for their applications,
> and probably almost none of the end users. What should they do, every
> time they start an application, go to console and set the oom_adj. You
> cannot expect them to do that.
>

oom_adj is an extremely important part of our infrastructure and although
the majority of Linux users may not use it (I know a number of opensource
programs that tune its own, however), we can't let go of our ability to
specify an oom killing priority.

There are no simple solutions to this problem: the model proposed thus
far, which has basically been to acknowledge that oom killer is a bad
thing to encounter (but within that, some rationale was found that we can
react however we want??) and should be extremely easy to understand (just
kill the memory hogger with the most resident RAM) is a non-starter.

What would be better, and what I think we'll end up with, is a root
selectable heuristic so that production servers and desktop machines can
use different heuristics to make oom kill selections. We already have
/proc/sys/vm/oom_kill_allocating_task which I added 1-2 years ago to
address concerns specifically of SGI and their enormously long tasklist
scans. This would be variation on that idea and would include different
simplistic behaviors (such as always killing the most memory hogging task,
killing the most recently started task by the same uid, etc), and leave
the default heuristic much the same as currently.

2009-11-02 19:56:24

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

Andrea Arcangeli wrote:

> On Fri, Oct 30, 2009 at 03:41:12PM +0100, Vedran Furač wrote:
>> Oh... so this is because apps "reserve" (Committed_AS?) more then they
>> currently need.
>
> They don't actually reserve, they end up "reserving" if overcommit is
> set to 2 (OVERCOMMIT_NEVER)... Apps aren't reserving, more likely they
> simply avoid a flood of mmap when a single one is enough to map an
> huge MAP_PRIVATE region like shared libs that you may only execute
> partially (this is why total_vm is usually much bigger than real ram
> mapped by pagetables represented in rss). But those shared libs are
> 99% pageable and they don't need to stay in swap or ram, so
> overcommit-as greatly overstimates the actual needs even if shared lib
> loading wouldn't be 64bit optimized (i.e. large and a single one).

Thanks for info!

>> A the time of "malloc: Cannot allocate memory":
>>
>> CommitLimit: 3364440 kB
>> Committed_AS: 3240200 kB
>>
>> So probably everything is ok (and free is misleading). Overcommit is
>> unfortunately necessary if I want to be able to use all my memory.
>
> Add more swap.

I don't use swap. With current prices of RAM, swap is history, at least
for desktops. I hate when e.g. firefox gets swapped out if I don't use
it for a while. Removing swap decreased desktop latencies drastically.
And I don't care much if I'll loose 100MB of potential free memory that
could be used for disk cache...

Regards.

Vedran

2009-11-02 19:56:44

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

David Rientjes wrote:

> On Fri, 30 Oct 2009, Vedran Furac wrote:
>
>> Well, you are kernel hacker, not me. You know how linux mm works much
>> more than I do. I just reported a, what I think is a big problem, which
>> needs to be solved ASAP (2.6.33).
>
> The oom killer heuristics have not been changed recently, why is this
> suddenly a problem that needs to be immediately addressed? The heuristics
> you've been referring to have been used for at least three years.

It isn't "suddenly a problem", but only a problem, big long time
problem. If it is three years old, then it should have been addressed
asap three years ago (and we would not need to talk about it now,
hopefully).

> However, I don't think we can simply change the baseline (like the rss
> change which has been added to -mm (??)) and consider it a major
> improvement when it severely impacts how system administrators are able to
> tune the badness heuristic from userspace via /proc/pid/oom_adj. I'm sure
> you'd agree that user input is important in this matter and so that we
> should maximize that ability rather than make it more difficult. That's
> my main criticism of the suggestions thus far (and, sorry, but I have to
> look out for production server interests here: you can't take away our
> ability to influence oom badness scoring just because other simple
> heuristics may be more understandable).
>
> What would be better, and what I think we'll end up with, is a root
> selectable heuristic so that production servers and desktop machines can
> use different heuristics to make oom kill selections. We already have
> /proc/sys/vm/oom_kill_allocating_task which I added 1-2 years ago to
> address concerns specifically of SGI and their enormously long tasklist
> scans. This would be variation on that idea and would include different
> simplistic behaviors (such as always killing the most memory hogging task,
> killing the most recently started task by the same uid, etc), and leave
> the default heuristic much the same as currently.

OK, agreed. Did you take a look at the set of patches Kame sent today?

Regards,

Vedran

2009-11-02 19:58:51

by Vedran Furač

[permalink] [raw]
Subject: Re: Memory overcommit

David Rientjes wrote:

> On Fri, 30 Oct 2009, Vedran Furac wrote:
>
>>> The problem you identified in http://pastebin.com/f3f9674a0, however, is a
>>> forkbomb issue where the badness score should never have been so high for
>>> kdeinit4 compared to "test". That's directly proportional to adding the
>>> scores of all disjoint child total_vm values into the badness score for
>>> the parent and then killing the children instead.
>> Could you explain me why ntpd invoked oom killer? Its parent is init. Or
>> syslog-ng?
>>
>
> Because it attempted an order-0 GFP_USER allocation and direct reclaim
> could not free any pages.
>
> The task that invoked the oom killer is simply the unlucky task that tried
> an allocation that couldn't be satisified through direct reclaim. It's
> usually unrelated to the task chosen for kill unless
> /proc/sys/vm/oom_kill_allocating_task is enabled (which SGI requested to
> avoid excessively long tasklist scans).

Oh, well, I didn't know that. Maybe rephrasing of that part of the
output would help eliminating future misinterpretation.

>> OK then, if you have a solution, I would be glad to test your patch. I
>> won't care much if you don't change total_vm as a baseline. Just make
>> random killing history.
>
> The only randomness is in selecting a task that has a different mm from
> the parent in the order of its child list. Yes, that can be addressed by
> doing a smarter iteration through the children before killing one of them.
>
> Keep in mind that a heuristic as simple as this:
>
> - kill the task that was started most recently by the same uid, or
>
> - kill the task that was started most recently on the system if a root
> task calls the oom killer,
>
> would have yielded perfect results for your testcase but isn't necessarily
> something that we'd ever want to see.

Of course, I want algorithm that works well in all possible situations.

Regards,

Vedran

2009-11-03 20:49:58

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Fri, 30 Oct 2009, KAMEZAWA Hiroyuki wrote:

> > > - The kernel can't know the program is bad or not. just guess it.
> >
> > Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
> > can tell the kernel what we'd like the oom killer behavior should be if
> > the situation arises.
> >
>
> My point is that the server cannot distinguish memory leak from intentional
> memory usage. No other than that.
>

That's a different point. Today, we can influence the badness score of
any user thread to prioritize oom killing from userspace and that can be
done regardless of whether there's a memory leaker, a fork bomber, etc.
The priority based oom killing is important to production scenarios and
cannot be replaced by a heuristic that works everytime if it cannot be
influenced by userspace.

A spike in memory consumption when a process is initially forked would be
defined as a memory leaker in your quiet_time model.

> In this summer, at lunch with a daily linux user, I was said
> "you, enterprise guys, don't consider desktop or laptop problem at all."
> yes, I use only servers. My customer uses server, too. My first priority
> is always on server users.
> But, for this time, I wrote reply to Vedran and try to fix desktop problem.
> Even if current logic works well for servers, "KDE/GNOME is killed" problem
> seems to be serious. And this may be a problem for EMBEDED people, I guess.
>

You argued before that the problem wasn't specific to X (after I said you
could protect it very trivially with /proc/pid/oom_adj set to
OOM_DISABLE), but that's now your reasoning for rewriting the oom killer
heuristics?

> I can say the same thing to total_vm size. total_vm size doesn't include any
> good information for oom situation. And tweaking based on that not-useful
> parameter will make things worse.
>

Tweaking on the heuristic will probably make it more convoluted and
overall worse, I agree. But it's a more stable baseline than rss from
which we can set oom killing priorities from userspace.

2009-11-04 00:52:54

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Memory overcommit

On Tue, 3 Nov 2009 12:49:52 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Fri, 30 Oct 2009, KAMEZAWA Hiroyuki wrote:
>
> > > > - The kernel can't know the program is bad or not. just guess it.
> > >
> > > Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
> > > can tell the kernel what we'd like the oom killer behavior should be if
> > > the situation arises.
> > >
> >
> > My point is that the server cannot distinguish memory leak from intentional
> > memory usage. No other than that.
> >
>
> That's a different point. Today, we can influence the badness score of
> any user thread to prioritize oom killing from userspace and that can be
> done regardless of whether there's a memory leaker, a fork bomber, etc.
> The priority based oom killing is important to production scenarios and
> cannot be replaced by a heuristic that works everytime if it cannot be
> influenced by userspace.
>
I don't removed oom_adj...

> A spike in memory consumption when a process is initially forked would be
> defined as a memory leaker in your quiet_time model.
>
I'll rewrite or drop quiet_time.

> > In this summer, at lunch with a daily linux user, I was said
> > "you, enterprise guys, don't consider desktop or laptop problem at all."
> > yes, I use only servers. My customer uses server, too. My first priority
> > is always on server users.
> > But, for this time, I wrote reply to Vedran and try to fix desktop problem.
> > Even if current logic works well for servers, "KDE/GNOME is killed" problem
> > seems to be serious. And this may be a problem for EMBEDED people, I guess.
> >
>
> You argued before that the problem wasn't specific to X (after I said you
> could protect it very trivially with /proc/pid/oom_adj set to
> OOM_DISABLE), but that's now your reasoning for rewriting the oom killer
> heuristics?
>
One of reasons. My cusotomers always suffers from "OOM-RANDOM-KILLER".
Why I mentioned about "lunch" is for saying that "I'm not working _only_
for servers."
ok ?


> > I can say the same thing to total_vm size. total_vm size doesn't include any
> > good information for oom situation. And tweaking based on that not-useful
> > parameter will make things worse.
> >
>
> Tweaking on the heuristic will probably make it more convoluted and
> overall worse, I agree. But it's a more stable baseline than rss from
> which we can set oom killing priorities from userspace.

- "rss < total_vm_size" always.
- oom_adj culculation is quite strong.
- total_vm of processes which maps hugetlb is very big ....but killing them
is no help for usual oom.

I recommend you to add "stable baseline" knob for user space, as I wrote.
My patch 6 adds stable baseline bonus as 50% of vm size if run_time is enough
large.

If users can estimate how their process uses memory, it will be good thing.
I'll add some other than oom_adj (I don't say I'll drop oom_adj).

Thanks,
-Kame





2009-11-04 01:58:07

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:

> > That's a different point. Today, we can influence the badness score of
> > any user thread to prioritize oom killing from userspace and that can be
> > done regardless of whether there's a memory leaker, a fork bomber, etc.
> > The priority based oom killing is important to production scenarios and
> > cannot be replaced by a heuristic that works everytime if it cannot be
> > influenced by userspace.
> >
> I don't removed oom_adj...
>

Right, but we must ensure that we have the same ability to influence a
priority based oom killing scheme from userspace as we currently do with a
relatively static total_vm. total_vm may not be the optimal baseline, but
it does allow users to tune oom_adj specifically to identify tasks that
are using more memory than expected and to be static enough to not depend
on rss, for example, that is really hard to predict at the time of oom.

That's actually my main goal in this discussion: to avoid losing any
ability of userspace to influence to priority of tasks being oom killed
(if you haven't noticed :).

> > Tweaking on the heuristic will probably make it more convoluted and
> > overall worse, I agree. But it's a more stable baseline than rss from
> > which we can set oom killing priorities from userspace.
>
> - "rss < total_vm_size" always.

But rss is much more dynamic than total_vm, that's my point.

> - oom_adj culculation is quite strong.
> - total_vm of processes which maps hugetlb is very big ....but killing them
> is no help for usual oom.
>
> I recommend you to add "stable baseline" knob for user space, as I wrote.
> My patch 6 adds stable baseline bonus as 50% of vm size if run_time is enough
> large.
>

There's no clear relationship between VM size and runtime. The forkbomb
heuristic itself could easily return a badness of ULONG_MAX if one is
detected using runtime and number of children, as I earlier proposed, but
that doesn't seem helpful to factor into the scoring.

2009-11-04 02:19:37

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Memory overcommit

On Tue, 3 Nov 2009 17:58:04 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > > That's a different point. Today, we can influence the badness score of
> > > any user thread to prioritize oom killing from userspace and that can be
> > > done regardless of whether there's a memory leaker, a fork bomber, etc.
> > > The priority based oom killing is important to production scenarios and
> > > cannot be replaced by a heuristic that works everytime if it cannot be
> > > influenced by userspace.
> > >
> > I don't removed oom_adj...
> >
>
> Right, but we must ensure that we have the same ability to influence a
> priority based oom killing scheme from userspace as we currently do with a
> relatively static total_vm. total_vm may not be the optimal baseline, but
> it does allow users to tune oom_adj specifically to identify tasks that
> are using more memory than expected and to be static enough to not depend
> on rss, for example, that is really hard to predict at the time of oom.
>
> That's actually my main goal in this discussion: to avoid losing any
> ability of userspace to influence to priority of tasks being oom killed
> (if you haven't noticed :).
>
> > > Tweaking on the heuristic will probably make it more convoluted and
> > > overall worse, I agree. But it's a more stable baseline than rss from
> > > which we can set oom killing priorities from userspace.
> >
> > - "rss < total_vm_size" always.
>
> But rss is much more dynamic than total_vm, that's my point.
>
My point and your point are differnt.

1. All my concern is "baseline for heuristics"
2. All your concern is "baseline for knob, as oom_adj"

ok ? For selecting victim by the kernel, dynamic value is much more useful.
Current behavior of "Random kill" and "Kill multiple processes" are too bad.
Considering oom-killer is for what, I think "1" is more important.

But I know what you want, so, I offers new knob which is not affected by RSS
as I wrote in previous mail.

Off-topic:
As memcg is growing better, using OOM-Killer for resource control should be
ended, I think. Maybe Fake-NUMA+cpuset is working well for google system,
but plz consider to use memcg.



> > - oom_adj culculation is quite strong.
> > - total_vm of processes which maps hugetlb is very big ....but killing them
> > is no help for usual oom.
> >
> > I recommend you to add "stable baseline" knob for user space, as I wrote.
> > My patch 6 adds stable baseline bonus as 50% of vm size if run_time is enough
> > large.
> >
>
> There's no clear relationship between VM size and runtime. The forkbomb
> heuristic itself could easily return a badness of ULONG_MAX if one is
> detected using runtime and number of children, as I earlier proposed, but
> that doesn't seem helpful to factor into the scoring.
>

Old processes are important, younger are not. But as I wrote, I'll drop
most of patch "6". So, plz forget about this part.

I'm interested in fork-bomb killer rather than crazy badness calculation, now.

Thanks,
-Kame


2009-11-04 03:10:38

by David Rientjes

[permalink] [raw]
Subject: Re: Memory overcommit

On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:

> My point and your point are differnt.
>
> 1. All my concern is "baseline for heuristics"
> 2. All your concern is "baseline for knob, as oom_adj"
>
> ok ? For selecting victim by the kernel, dynamic value is much more useful.
> Current behavior of "Random kill" and "Kill multiple processes" are too bad.
> Considering oom-killer is for what, I think "1" is more important.
>
> But I know what you want, so, I offers new knob which is not affected by RSS
> as I wrote in previous mail.
>
> Off-topic:
> As memcg is growing better, using OOM-Killer for resource control should be
> ended, I think. Maybe Fake-NUMA+cpuset is working well for google system,
> but plz consider to use memcg.
>

I understand what you're trying to do, and I agree with it for most
desktop systems. However, I think that admins should have a very strong
influence in what tasks the oom killer kills. It doesn't really matter if
it's via oom_adj or not, and its debatable whether an adjustment on a
static heuristic score is in our best interest in the first place. But we
must have an alternative so that our control over oom killing isn't lost.

I'd also like to open another topic for discussion if you're proposing
such sweeping changes: at what point do we allow ~__GFP_NOFAIL allocations
to fail even if order < PAGE_ALLOC_COSTLY_ORDER and defer killing
anything? We both agreed that it's not always in the best interest to
kill a task so that an allocation can succeed, so we need to define some
criteria to simply fail the allocation instead.

> Old processes are important, younger are not. But as I wrote, I'll drop
> most of patch "6". So, plz forget about this part.
>
> I'm interested in fork-bomb killer rather than crazy badness calculation, now.
>

Ok, great. Thanks.

2009-11-04 03:22:34

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Memory overcommit

On Tue, 3 Nov 2009 19:10:34 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > My point and your point are differnt.
> >
> > 1. All my concern is "baseline for heuristics"
> > 2. All your concern is "baseline for knob, as oom_adj"
> >
> > ok ? For selecting victim by the kernel, dynamic value is much more useful.
> > Current behavior of "Random kill" and "Kill multiple processes" are too bad.
> > Considering oom-killer is for what, I think "1" is more important.
> >
> > But I know what you want, so, I offers new knob which is not affected by RSS
> > as I wrote in previous mail.
> >
> > Off-topic:
> > As memcg is growing better, using OOM-Killer for resource control should be
> > ended, I think. Maybe Fake-NUMA+cpuset is working well for google system,
> > but plz consider to use memcg.
> >
>
> I understand what you're trying to do, and I agree with it for most
> desktop systems. However, I think that admins should have a very strong
> influence in what tasks the oom killer kills. It doesn't really matter if
> it's via oom_adj or not, and its debatable whether an adjustment on a
> static heuristic score is in our best interest in the first place. But we
> must have an alternative so that our control over oom killing isn't lost.
>
I'll not go too quickly, so, let's discuss and rewrite patches more, later.
I'll parepare new version in the next week. For this week, I'll post
swap accounting and improve fork-bomb detector.

> I'd also like to open another topic for discussion if you're proposing
> such sweeping changes: at what point do we allow ~__GFP_NOFAIL allocations
> to fail even if order < PAGE_ALLOC_COSTLY_ORDER and defer killing
> anything? We both agreed that it's not always in the best interest to
> kill a task so that an allocation can succeed, so we need to define some
> criteria to simply fail the allocation instead.
>
Yes, I think allocation itself (> order=0) should fail more before we finally
invoke OOM. It tends to be soft-landing rather than oom-killer.

Thanks,
-Kame

2009-11-05 19:02:07

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

Hi!

> Agreed it's not obvious. Intuitively I think only including RSS and no
> swap is best, but clearly I can't be entirely against including swap
> too as there may be scenarios where including swap provides for a
> better choice.
>
> My argument for not including swap is that we kill tasks to free RAM
> (we don't really care to free swap, system needs RAM at oom time).

System should be out of _virtual_ memory at that point, so yes,
freeing swap should help, too.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html