2009-10-28 09:01:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [PATCH] oom_kill: use rss value instead of vm size for badness

I may add more tweaks based on this. But simple start point as this patch
will be good. This patch is based on mmotm + Kosaki's
http://marc.info/?l=linux-kernel&m=125669809305167&w=2

Test results on various environment are appreciated.

Regards.
-Kame
==
From: KAMEZAWA Hiroyuki <[email protected]>

It's reported that OOM-Killer kills Gnone/KDE at first...
And yes, we can reproduce it easily.

Now, oom-killer uses mm->total_vm as its base value. But in recent
applications, there are a big gap between VM size and RSS size.
Because
- Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
- Applications may alloc big VM area but use small part of them.
(Java, and multi-threaded applications has this tendency because
of default-size of stack.)

I think using mm->total_vm as score for oom-kill is not good.
By the same reason, overcommit memory can't work as expected.
(In other words, if we depends on total_vm, using overcommit more positive
is a good choice.)

This patch uses mm->anon_rss/file_rss as base value for calculating badness.

Following is changes to OOM score(badness) on an environment with 1.6G memory
plus memory-eater(500M & 1G).

Top 10 of badness score. (The highest one is the first candidate to be killed)
Before
badness program
91228 gnome-settings-
94210 clock-applet
103202 mixer_applet2
106563 tomboy
112947 gnome-terminal
128944 mmap <----------- 500M malloc
129332 nautilus
215476 bash <----------- parent of 2 mallocs.
256944 mmap <----------- 1G malloc
423586 gnome-session

After
badness
1911 mixer_applet2
1955 clock-applet
1986 xinit
1989 gnome-session
2293 nautilus
2955 gnome-terminal
4113 tomboy
104163 mmap <----------- 500M malloc.
168577 bash <----------- parent of 2 mallocs
232375 mmap <----------- 1G malloc

seems good for me.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/oom_kill.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -93,7 +93,7 @@ unsigned long badness(struct task_struct
/*
* The memory size of the process is the basis for the badness.
*/
- points = mm->total_vm;
+ points = get_mm_counter(mm, anon_rss) + get_mm_counter(mm, file_rss);

/*
* After this unlock we can no longer dereference local variable `mm'
@@ -116,8 +116,12 @@ unsigned long badness(struct task_struct
*/
list_for_each_entry(child, &p->children, sibling) {
task_lock(child);
- if (child->mm != mm && child->mm)
- points += child->mm->total_vm/2 + 1;
+ if (child->mm != mm && child->mm) {
+ unsigned long cpoints;
+ cpoints = get_mm_counter(child->mm, anon_rss);
+ + get_mm_counter(child->mm, file_rss);
+ points += cpoints/2 + 1;
+ }
task_unlock(child);
}


2009-10-28 09:15:49

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> It's reported that OOM-Killer kills Gnone/KDE at first...
> And yes, we can reproduce it easily.
>
> Now, oom-killer uses mm->total_vm as its base value. But in recent
> applications, there are a big gap between VM size and RSS size.
> Because
> - Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
> - Applications may alloc big VM area but use small part of them.
> (Java, and multi-threaded applications has this tendency because
> of default-size of stack.)
>
> I think using mm->total_vm as score for oom-kill is not good.
> By the same reason, overcommit memory can't work as expected.
> (In other words, if we depends on total_vm, using overcommit more positive
> is a good choice.)
>
> This patch uses mm->anon_rss/file_rss as base value for calculating badness.
>

How does this affect the ability of the user to tune the badness score of
individual threads? It seems like there will now only be two polarizing
options: the equivalent of an oom_adj value of +15 or -17. It is now
heavily dependent on the rss which may be unclear at the time of oom and
very dynamic.

I think a longer-term solution may rely more on the difference in
get_mm_hiwater_rss() and get_mm_rss() instead to know the difference
between what is resident in RAM at the time of oom compared to what has
been swaped. Using this with get_mm_hiwater_vm() would produce a nice
picture for the pattern of each task's memory consumption.

> Following is changes to OOM score(badness) on an environment with 1.6G memory
> plus memory-eater(500M & 1G).
>
> Top 10 of badness score. (The highest one is the first candidate to be killed)
> Before
> badness program
> 91228 gnome-settings-
> 94210 clock-applet
> 103202 mixer_applet2
> 106563 tomboy
> 112947 gnome-terminal
> 128944 mmap <----------- 500M malloc
> 129332 nautilus
> 215476 bash <----------- parent of 2 mallocs.
> 256944 mmap <----------- 1G malloc
> 423586 gnome-session
>
> After
> badness
> 1911 mixer_applet2
> 1955 clock-applet
> 1986 xinit
> 1989 gnome-session
> 2293 nautilus
> 2955 gnome-terminal
> 4113 tomboy
> 104163 mmap <----------- 500M malloc.
> 168577 bash <----------- parent of 2 mallocs
> 232375 mmap <----------- 1G malloc
>
> seems good for me.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> mm/oom_kill.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> Index: mm-test-kernel/mm/oom_kill.c
> ===================================================================
> --- mm-test-kernel.orig/mm/oom_kill.c
> +++ mm-test-kernel/mm/oom_kill.c
> @@ -93,7 +93,7 @@ unsigned long badness(struct task_struct
> /*
> * The memory size of the process is the basis for the badness.
> */
> - points = mm->total_vm;
> + points = get_mm_counter(mm, anon_rss) + get_mm_counter(mm, file_rss);
>
> /*
> * After this unlock we can no longer dereference local variable `mm'
> @@ -116,8 +116,12 @@ unsigned long badness(struct task_struct
> */
> list_for_each_entry(child, &p->children, sibling) {
> task_lock(child);
> - if (child->mm != mm && child->mm)
> - points += child->mm->total_vm/2 + 1;
> + if (child->mm != mm && child->mm) {
> + unsigned long cpoints;
> + cpoints = get_mm_counter(child->mm, anon_rss);
> + + get_mm_counter(child->mm, file_rss);

That shouldn't compile.

> + points += cpoints/2 + 1;
> + }
> task_unlock(child);
> }
>

This can all be simplified by just using get_mm_rss(mm) and
get_mm_rss(child->mm).

2009-10-28 11:04:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

David Rientjes $B$5$s$O=q$-$^$7$?!'(B
> On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:
>
>> From: KAMEZAWA Hiroyuki <[email protected]>
>>
>> It's reported that OOM-Killer kills Gnone/KDE at first...
>> And yes, we can reproduce it easily.
>>
>> Now, oom-killer uses mm->total_vm as its base value. But in recent
>> applications, there are a big gap between VM size and RSS size.
>> Because
>> - Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
>> - Applications may alloc big VM area but use small part of them.
>> (Java, and multi-threaded applications has this tendency because
>> of default-size of stack.)
>>
>> I think using mm->total_vm as score for oom-kill is not good.
>> By the same reason, overcommit memory can't work as expected.
>> (In other words, if we depends on total_vm, using overcommit more
>> positive
>> is a good choice.)
>>
>> This patch uses mm->anon_rss/file_rss as base value for calculating
>> badness.
>>
>
> How does this affect the ability of the user to tune the badness score of
> individual threads?
Threads ? process ?

> It seems like there will now only be two polarizing
> options: the equivalent of an oom_adj value of +15 or -17. It is now
> heavily dependent on the rss which may be unclear at the time of oom and
> very dynamic.
>
yes. and that's "dynamic" is good thing.

I think one of problems for oom now is that user says "oom-killer kills
process at random." And yes, it's correct. mm->total_vm is not related
to memory usage. Then, oom-killer seems to kill processes at random.

For example, as Vetran shows, even if memory eater runs, processes are
killed _at random_.

After this patch, the biggest memory user will be the fist candidate
and it's reasonable. Users will know "The process is killed because
it uses much memory.", (seems not random) He can consider he should
use oom_adj for memory eater or not.



> I think a longer-term solution may rely more on the difference in
> get_mm_hiwater_rss() and get_mm_rss() instead to know the difference
> between what is resident in RAM at the time of oom compared to what has
> been swaped. Using this with get_mm_hiwater_vm() would produce a nice
> picture for the pattern of each task's memory consumption.
>
Hmm, I don't want complicated calculation (it makes oom_adj usage worse.)
but yes, bare rss may be too simple.
Anyway, as I shown, I'll add swap statistics regardless of this patch.
That may adds new hint.
For example)
if (vm_swap_full())
points += mm->swap_usage

>> Following is changes to OOM score(badness) on an environment with 1.6G
>> memory
>> plus memory-eater(500M & 1G).
>>
>> Top 10 of badness score. (The highest one is the first candidate to be
>> killed)
>> Before
>> badness program
>> 91228 gnome-settings-
>> 94210 clock-applet
>> 103202 mixer_applet2
>> 106563 tomboy
>> 112947 gnome-terminal
>> 128944 mmap <----------- 500M malloc
>> 129332 nautilus
>> 215476 bash <----------- parent of 2 mallocs.
>> 256944 mmap <----------- 1G malloc
>> 423586 gnome-session
>>
>> After
>> badness
>> 1911 mixer_applet2
>> 1955 clock-applet
>> 1986 xinit
>> 1989 gnome-session
>> 2293 nautilus
>> 2955 gnome-terminal
>> 4113 tomboy
>> 104163 mmap <----------- 500M malloc.
>> 168577 bash <----------- parent of 2 mallocs
>> 232375 mmap <----------- 1G malloc
>>
>> seems good for me.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>> ---
>> mm/oom_kill.c | 10 +++++++---
>> 1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> Index: mm-test-kernel/mm/oom_kill.c
>> ===================================================================
>> --- mm-test-kernel.orig/mm/oom_kill.c
>> +++ mm-test-kernel/mm/oom_kill.c
>> @@ -93,7 +93,7 @@ unsigned long badness(struct task_struct
>> /*
>> * The memory size of the process is the basis for the badness.
>> */
>> - points = mm->total_vm;
>> + points = get_mm_counter(mm, anon_rss) + get_mm_counter(mm, file_rss);
>>
>> /*
>> * After this unlock we can no longer dereference local variable `mm'
>> @@ -116,8 +116,12 @@ unsigned long badness(struct task_struct
>> */
>> list_for_each_entry(child, &p->children, sibling) {
>> task_lock(child);
>> - if (child->mm != mm && child->mm)
>> - points += child->mm->total_vm/2 + 1;
>> + if (child->mm != mm && child->mm) {
>> + unsigned long cpoints;
>> + cpoints = get_mm_counter(child->mm, anon_rss);
>> + + get_mm_counter(child->mm, file_rss);
>
> That shouldn't compile.
Oh, yes...thanks.

>
>> + points += cpoints/2 + 1;
>> + }
>> task_unlock(child);
>> }
>>
>
> This can all be simplified by just using get_mm_rss(mm) and
> get_mm_rss(child->mm).
>
will use that.

I'll wait until the next week to post a new patch.
We don't need rapid way.

Thanks,
-Kame


2009-10-29 01:03:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

> I'll wait until the next week to post a new patch.
> We don't need rapid way.
>
I wrote above...but for my mental health, this is bug-fixed version.
Sorry for my carelessness. David, thank you for your review.
Regards,
-Kame
==
From: KAMEZAWA Hiroyuki <[email protected]>

It's reported that OOM-Killer kills Gnone/KDE at first...
And yes, we can reproduce it easily.

Now, oom-killer uses mm->total_vm as its base value. But in recent
applications, there are a big gap between VM size and RSS size.
Because
- Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
- Applications may alloc big VM area but use small part of them.
(Java, and multi-threaded applications has this tendency because
of default-size of stack.)

I think using mm->total_vm as score for oom-kill is not good.
By the same reason, overcommit memory can't work as expected.
(In other words, if we depends on total_vm, using overcommit more positive
is a good choice.)

This patch uses mm->anon_rss/file_rss as base value for calculating badness.

Following is changes to OOM score(badness) on an environment with 1.6G memory
plus memory-eater(500M & 1G).

Top 10 of badness score. (The highest one is the first candidate to be killed)
Before
badness program
91228 gnome-settings-
94210 clock-applet
103202 mixer_applet2
106563 tomboy
112947 gnome-terminal
128944 mmap <----------- 500M malloc
129332 nautilus
215476 bash <----------- parent of 2 mallocs.
256944 mmap <----------- 1G malloc
423586 gnome-session

After
badness
1911 mixer_applet2
1955 clock-applet
1986 xinit
1989 gnome-session
2293 nautilus
2955 gnome-terminal
4113 tomboy
104163 mmap <----------- 500M malloc.
168577 bash <----------- parent of 2 mallocs
232375 mmap <----------- 1G malloc

seems good for me. Maybe we can tweak this patch more,
but this one will be a good one as a start point.

Changelog: 2009/10/29
- use get_mm_rss() instead of get_mm_counter()

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/oom_kill.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -93,7 +93,7 @@ unsigned long badness(struct task_struct
/*
* The memory size of the process is the basis for the badness.
*/
- points = mm->total_vm;
+ points = get_mm_rss(mm);

/*
* After this unlock we can no longer dereference local variable `mm'
@@ -117,7 +117,7 @@ unsigned long badness(struct task_struct
list_for_each_entry(child, &p->children, sibling) {
task_lock(child);
if (child->mm != mm && child->mm)
- points += child->mm->total_vm/2 + 1;
+ points += get_mm_rss(child->mm)/2 + 1;
task_unlock(child);
}


2009-10-29 02:31:44

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Thu, Oct 29, 2009 at 10:00 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>> I'll wait until the next week to post a new patch.
>> We don't need rapid way.
>>
> I wrote above...but for my mental health, this is bug-fixed version.
> Sorry for my carelessness. David, thank you for your review.
> Regards,
> -Kame
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> It's reported that OOM-Killer kills Gnone/KDE at first...
> And yes, we can reproduce it easily.
>
> Now, oom-killer uses mm->total_vm as its base value. But in recent
> applications, there are a big gap between VM size and RSS size.
> Because
> ?- Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
> ?- Applications may alloc big VM area but use small part of them.
> ? ?(Java, and multi-threaded applications has this tendency because
> ? ? of default-size of stack.)
>
> I think using mm->total_vm as score for oom-kill is not good.
> By the same reason, overcommit memory can't work as expected.
> (In other words, if we depends on total_vm, using overcommit more positive
> ?is a good choice.)
>
> This patch uses mm->anon_rss/file_rss as base value for calculating badness.
>
> Following is changes to OOM score(badness) on an environment with 1.6G memory
> plus memory-eater(500M & 1G).
>
> Top 10 of badness score. (The highest one is the first candidate to be killed)
> Before
> badness program
> 91228 ? gnome-settings-
> 94210 ? clock-applet
> 103202 ?mixer_applet2
> 106563 ?tomboy
> 112947 ?gnome-terminal
> 128944 ?mmap ? ? ? ? ? ? ?<----------- 500M malloc
> 129332 ?nautilus
> 215476 ?bash ? ? ? ? ? ? ?<----------- parent of 2 mallocs.
> 256944 ?mmap ? ? ? ? ? ? ?<----------- 1G malloc
> 423586 ?gnome-session
>
> After
> badness
> 1911 ? ?mixer_applet2
> 1955 ? ?clock-applet
> 1986 ? ?xinit
> 1989 ? ?gnome-session
> 2293 ? ?nautilus
> 2955 ? ?gnome-terminal
> 4113 ? ?tomboy
> 104163 ?mmap ? ? ? ? ? ? <----------- 500M malloc.
> 168577 ?bash ? ? ? ? ? ? <----------- parent of 2 mallocs
> 232375 ?mmap ? ? ? ? ? ? <----------- 1G malloc
>
> seems good for me. Maybe we can tweak this patch more,
> but this one will be a good one as a start point.
>
> Changelog: 2009/10/29
> ?- use get_mm_rss() instead of get_mm_counter()
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

Let's start from this.

--
Kind regards,
Minchan Kim

2009-10-29 08:32:08

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Thu, 29 Oct 2009, KAMEZAWA Hiroyuki wrote:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> It's reported that OOM-Killer kills Gnone/KDE at first...
> And yes, we can reproduce it easily.
>
> Now, oom-killer uses mm->total_vm as its base value. But in recent
> applications, there are a big gap between VM size and RSS size.
> Because
> - Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
> - Applications may alloc big VM area but use small part of them.
> (Java, and multi-threaded applications has this tendency because
> of default-size of stack.)
>
> I think using mm->total_vm as score for oom-kill is not good.
> By the same reason, overcommit memory can't work as expected.
> (In other words, if we depends on total_vm, using overcommit more positive
> is a good choice.)
>
> This patch uses mm->anon_rss/file_rss as base value for calculating badness.
>
> Following is changes to OOM score(badness) on an environment with 1.6G memory
> plus memory-eater(500M & 1G).
>
> Top 10 of badness score. (The highest one is the first candidate to be killed)
> Before
> badness program
> 91228 gnome-settings-
> 94210 clock-applet
> 103202 mixer_applet2
> 106563 tomboy
> 112947 gnome-terminal
> 128944 mmap <----------- 500M malloc
> 129332 nautilus
> 215476 bash <----------- parent of 2 mallocs.
> 256944 mmap <----------- 1G malloc
> 423586 gnome-session
>
> After
> badness
> 1911 mixer_applet2
> 1955 clock-applet
> 1986 xinit
> 1989 gnome-session
> 2293 nautilus
> 2955 gnome-terminal
> 4113 tomboy
> 104163 mmap <----------- 500M malloc.
> 168577 bash <----------- parent of 2 mallocs
> 232375 mmap <----------- 1G malloc
>
> seems good for me. Maybe we can tweak this patch more,
> but this one will be a good one as a start point.
>

This appears to actually prefer X more than total_vm in Vedran's test
case. He cited http://pastebin.com/f3f9674a0 in
http://marc.info/?l=linux-kernel&m=125678557002888.

There are 12 ooms in this log, which has /proc/sys/vm/oom_dump_tasks
enabled. It shows the difference between the top total_vm candidates vs.
the top rss candidates.

total_vm
708945 test
195695 krunner
168881 plasma-desktop
130567 ktorrent
127081 knotify4
125881 icedove-bin
123036 akregator
118641 kded4

rss
707878 test
42201 Xorg
13300 icedove-bin
10209 ktorrent
9277 akregator
8878 plasma-desktop
7546 krunner
4532 mysqld

This patch would pick the memory hogging task, "test", first everytime
just like the current implementation does. It would then prefer Xorg,
icedove-bin, and ktorrent next as a starting point.

Admittedly, there are other heuristics that the oom killer uses to create
a badness score. But since this patch is only changing the baseline from
mm->total_vm to get_mm_rss(mm), its behavior in this test case do not
match the patch description.

The vast majority of the other ooms have identical top 8 candidates:

total_vm
673222 test
195695 krunner
168881 plasma-desktop
130567 ktorrent
127081 knotify4
125881 icedove-bin
123036 akregator
121869 firefox-bin

rss
672271 test
42192 Xorg
30763 firefox-bin
13292 icedove-bin
10208 ktorrent
9260 akregator
8859 plasma-desktop
7528 krunner

firefox-bin seems much more preferred in this case than total_vm, but Xorg
still ranks very high with this patch compared to the current
implementation.

2009-10-29 08:49:06

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Thu, 29 Oct 2009 01:31:59 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> On Thu, 29 Oct 2009, KAMEZAWA Hiroyuki wrote:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > It's reported that OOM-Killer kills Gnone/KDE at first...
> > And yes, we can reproduce it easily.
> >
> > Now, oom-killer uses mm->total_vm as its base value. But in recent
> > applications, there are a big gap between VM size and RSS size.
> > Because
> > - Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
> > - Applications may alloc big VM area but use small part of them.
> > (Java, and multi-threaded applications has this tendency because
> > of default-size of stack.)
> >
> > I think using mm->total_vm as score for oom-kill is not good.
> > By the same reason, overcommit memory can't work as expected.
> > (In other words, if we depends on total_vm, using overcommit more positive
> > is a good choice.)
> >
> > This patch uses mm->anon_rss/file_rss as base value for calculating badness.
> >
> > Following is changes to OOM score(badness) on an environment with 1.6G memory
> > plus memory-eater(500M & 1G).
> >
> > Top 10 of badness score. (The highest one is the first candidate to be killed)
> > Before
> > badness program
> > 91228 gnome-settings-
> > 94210 clock-applet
> > 103202 mixer_applet2
> > 106563 tomboy
> > 112947 gnome-terminal
> > 128944 mmap <----------- 500M malloc
> > 129332 nautilus
> > 215476 bash <----------- parent of 2 mallocs.
> > 256944 mmap <----------- 1G malloc
> > 423586 gnome-session
> >
> > After
> > badness
> > 1911 mixer_applet2
> > 1955 clock-applet
> > 1986 xinit
> > 1989 gnome-session
> > 2293 nautilus
> > 2955 gnome-terminal
> > 4113 tomboy
> > 104163 mmap <----------- 500M malloc.
> > 168577 bash <----------- parent of 2 mallocs
> > 232375 mmap <----------- 1G malloc
> >
> > seems good for me. Maybe we can tweak this patch more,
> > but this one will be a good one as a start point.
> >
>
> This appears to actually prefer X more than total_vm in Vedran's test
> case. He cited http://pastebin.com/f3f9674a0 in
> http://marc.info/?l=linux-kernel&m=125678557002888.
>
> There are 12 ooms in this log, which has /proc/sys/vm/oom_dump_tasks
> enabled. It shows the difference between the top total_vm candidates vs.
> the top rss candidates.
>
> total_vm
> 708945 test
> 195695 krunner
> 168881 plasma-desktop
> 130567 ktorrent
> 127081 knotify4
> 125881 icedove-bin
> 123036 akregator
> 118641 kded4
>
> rss
> 707878 test
> 42201 Xorg
> 13300 icedove-bin
> 10209 ktorrent
> 9277 akregator
> 8878 plasma-desktop
> 7546 krunner
> 4532 mysqld
>
> This patch would pick the memory hogging task, "test", first everytime
> just like the current implementation does. It would then prefer Xorg,
> icedove-bin, and ktorrent next as a starting point.
>
> Admittedly, there are other heuristics that the oom killer uses to create
> a badness score. But since this patch is only changing the baseline from
> mm->total_vm to get_mm_rss(mm), its behavior in this test case do not
> match the patch description.
>
yes, then I wrote "as start point". There are many environments.
But I'm not sure why ntpd can be the first candidate...
The scores you shown doesn't include children's score, right ?

I believe I'll have to remove "adding child's score to parents".
I'm now considering how to implement fork-bomb detector for removing it.

> The vast majority of the other ooms have identical top 8 candidates:
>
> total_vm
> 673222 test
> 195695 krunner
> 168881 plasma-desktop
> 130567 ktorrent
> 127081 knotify4
> 125881 icedove-bin
> 123036 akregator
> 121869 firefox-bin
>
> rss
> 672271 test
> 42192 Xorg
> 30763 firefox-bin
> 13292 icedove-bin
> 10208 ktorrent
> 9260 akregator
> 8859 plasma-desktop
> 7528 krunner
>
> firefox-bin seems much more preferred in this case than total_vm, but Xorg
> still ranks very high with this patch compared to the current
> implementation.
>
ya, I'm now considering to drop file_rss from calculation.

some reasons.

- file caches remaining in memory at OOM tend to have some trouble to remove it.
- file caches tend to be shared.
- if file caches are from shmem, we never be able to drop them if no swap/swapfull.

Maybe we'll have better result.

Regards,
-Kame



2009-10-29 09:01:52

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Thu, 29 Oct 2009, KAMEZAWA Hiroyuki wrote:

> > This appears to actually prefer X more than total_vm in Vedran's test
> > case. He cited http://pastebin.com/f3f9674a0 in
> > http://marc.info/?l=linux-kernel&m=125678557002888.
> >
> > There are 12 ooms in this log, which has /proc/sys/vm/oom_dump_tasks
> > enabled. It shows the difference between the top total_vm candidates vs.
> > the top rss candidates.
> >
> > total_vm
> > 708945 test
> > 195695 krunner
> > 168881 plasma-desktop
> > 130567 ktorrent
> > 127081 knotify4
> > 125881 icedove-bin
> > 123036 akregator
> > 118641 kded4
> >
> > rss
> > 707878 test
> > 42201 Xorg
> > 13300 icedove-bin
> > 10209 ktorrent
> > 9277 akregator
> > 8878 plasma-desktop
> > 7546 krunner
> > 4532 mysqld
> >
> > This patch would pick the memory hogging task, "test", first everytime
> > just like the current implementation does. It would then prefer Xorg,
> > icedove-bin, and ktorrent next as a starting point.
> >
> > Admittedly, there are other heuristics that the oom killer uses to create
> > a badness score. But since this patch is only changing the baseline from
> > mm->total_vm to get_mm_rss(mm), its behavior in this test case do not
> > match the patch description.
> >
> yes, then I wrote "as start point". There are many environments.

And this environment has a particularly bad result.

> But I'm not sure why ntpd can be the first candidate...
> The scores you shown doesn't include children's score, right ?
>

Right, it's just the get_mm_rss(mm) for each thread shown in the oom dump,
the same value you've used as the new baseline. The actual badness scores
could easily be calculated by cat'ing /proc/*/oom_score prior to oom, but
this data was meant to illustrate the preference given the rss compared to
total_vm in a heuristic sense.

> I believe I'll have to remove "adding child's score to parents".
> I'm now considering how to implement fork-bomb detector for removing it.
>

Agreed, I'm looking forward to your proposal.

> ya, I'm now considering to drop file_rss from calculation.
>
> some reasons.
>
> - file caches remaining in memory at OOM tend to have some trouble to remove it.
> - file caches tend to be shared.
> - if file caches are from shmem, we never be able to drop them if no swap/swapfull.
>
> Maybe we'll have better result.
>

That sounds more appropriate.

I'm surprised you still don't see a value in using the peak VM and RSS
sizes, though, as part of your formula as it would indicate the proportion
of memory resident in RAM at the time of oom.

2009-10-29 09:19:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Thu, 29 Oct 2009 02:01:49 -0700 (PDT)
David Rientjes <[email protected]> wrote:
> > yes, then I wrote "as start point". There are many environments.
>
> And this environment has a particularly bad result.
> yes, then I wrote "as start point". There are many environments.

In my understanding, 2nd, 3rd candidates are not important. If both of
total_vm and RSS catches the same process as 1st candidate, it's ok.
(i.e. If killed, oom situation will go away.)


> > ya, I'm now considering to drop file_rss from calculation.
> >
> > some reasons.
> >
> > - file caches remaining in memory at OOM tend to have some trouble to remove it.
> > - file caches tend to be shared.
> > - if file caches are from shmem, we never be able to drop them if no swap/swapfull.
> >
> > Maybe we'll have better result.
> >
>
> That sounds more appropriate.
>
> I'm surprised you still don't see a value in using the peak VM and RSS
> sizes, though, as part of your formula as it would indicate the proportion
> of memory resident in RAM at the time of oom.
>
I'll use swap_usage instead of peak VM size as bonus.

anon_rss + swap_usage/2 ? or some.

My first purpose is not to kill not-guilty process at random.
If memory eater is killed, it's reasnoable.

In my consideration

- "Killing a process because of OOM" is something bad, but not avoidable.

- We don't need to do compliated/too-wise calculation for killing a process.
"The worst one is memory-eater!" is easy to understand to users and admins.

- We have oom_adj, now. User can customize it if he run _important_ memory eater.

- But fork-bomb doesn't seem memory eater if we see each process.
We need some cares.

Then,
- I'd like to drop file_rss.
- I'd like to take swap_usage into acccount.
- I'd like to remove cpu_time bonus. runtime bonus is much more important.
- I'd like to remove penalty from children. To do that, fork-bomb detector
is necessary.
- nice bonus is bad. (We have oom_adj instead of this.) It should be
if (task_nice(p) < 0)
points /= 2;
But we have "root user" bonus already. We can remove this line.

After above, much more simple selection, easy-to-understand, will be done.

Thanks,
-Kame

2009-10-29 09:44:50

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Thu, 29 Oct 2009, KAMEZAWA Hiroyuki wrote:

> > > yes, then I wrote "as start point". There are many environments.
> >
> > And this environment has a particularly bad result.
> > yes, then I wrote "as start point". There are many environments.
>
> In my understanding, 2nd, 3rd candidates are not important. If both of
> total_vm and RSS catches the same process as 1st candidate, it's ok.
> (i.e. If killed, oom situation will go away.)
>

The ordering would matter on a machine with smaller capacity or if Vedran
was using mem=, theoretically at the size of its current capacity minus
the amount of anonymous memory being mlocked by the "test" program. When
the oom occurs (and notice it's not triggered by "test" each time), it
would have killed Xorg in what would otherwise be the same conditions.

> > I'm surprised you still don't see a value in using the peak VM and RSS
> > sizes, though, as part of your formula as it would indicate the proportion
> > of memory resident in RAM at the time of oom.
> >
> I'll use swap_usage instead of peak VM size as bonus.
>
> anon_rss + swap_usage/2 ? or some.
>
> My first purpose is not to kill not-guilty process at random.
> If memory eater is killed, it's reasnoable.
>

We again arrive at the distinction I made earlier where there're two
approaches: kill a task that is consuming the majority of resident RAM, or
kill a thread that is using much more memory than expected such as a
memory leaker. I know that you've argued that the kernel can never know
the latter, which I agree, but it does have the benefit of allowing the
user to have more input and determine when an actual task is using much
more RAM than expected; the anon_rss and swap_usage in your formula is
highly dynamic, so you've have to expect the user to dynamically alter
oom_adj to specify a preference in the case of the memory leaker.

> In my consideration
>
> - "Killing a process because of OOM" is something bad, but not avoidable.
>
> - We don't need to do compliated/too-wise calculation for killing a process.
> "The worst one is memory-eater!" is easy to understand to users and admins.
>

Is this a proposal to remove the remainder of the heuristics as well such
as considering superuser tasks and those with longer uptimes? I'd agree
with removing most of it other than the oom_adj and current->mems_allowed
intersection penalty. We're probably going to need rewrite the badness
heuristic from scratch instead of simply changing the baseline.

> - We have oom_adj, now. User can customize it if he run _important_ memory eater.
>

If he runs an important memory eater, he can always polarize it by
disabling oom killing completely for that task. However, oom_adj is also
used to identify memory leakers when the amount of memory that it uses is
roughly known. Most people don't know how much memory their applications
use, but there are systems where users have tuned oom_adj specifically
based on comparative /proc/pid/oom_score results. Simply using anon_rss
and swap_usage will make that vary much more than previously.

> - But fork-bomb doesn't seem memory eater if we see each process.
> We need some cares.
>

The forkbomb can be addressed in multiple ways, the most simple of which
is simply counting the number of children and their runtime. It'd
probably even be better to isolate the forkbomb case away from the badness
score and simply kill the parent by returning ULONG_MAX when it's
recognized.

> Then,
> - I'd like to drop file_rss.
> - I'd like to take swap_usage into acccount.
> - I'd like to remove cpu_time bonus. runtime bonus is much more important.
> - I'd like to remove penalty from children. To do that, fork-bomb detector
> is necessary.
> - nice bonus is bad. (We have oom_adj instead of this.) It should be
> if (task_nice(p) < 0)
> points /= 2;
> But we have "root user" bonus already. We can remove this line.
>
> After above, much more simple selection, easy-to-understand, will be done.
>

Agreed, I think we'll need to rewrite most of the heuristic from scratch.

2009-10-29 23:44:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Thu, 29 Oct 2009 02:44:45 -0700 (PDT)
David Rientjes <[email protected]> wrote:
> > Then,
> > - I'd like to drop file_rss.
> > - I'd like to take swap_usage into acccount.
> > - I'd like to remove cpu_time bonus. runtime bonus is much more important.
> > - I'd like to remove penalty from children. To do that, fork-bomb detector
> > is necessary.
> > - nice bonus is bad. (We have oom_adj instead of this.) It should be
> > if (task_nice(p) < 0)
> > points /= 2;
> > But we have "root user" bonus already. We can remove this line.
> >
> > After above, much more simple selection, easy-to-understand, will be done.
> >
>
> Agreed, I think we'll need to rewrite most of the heuristic from scratch.

I'd like to post total redesgin of oom-killer in the next week.
plz wait.

Thanks,
-Kame

2009-11-01 13:30:15

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

> This patch would pick the memory hogging task, "test", first everytime
> just like the current implementation does. ?It would then prefer Xorg,
> icedove-bin, and ktorrent next as a starting point.
>
> Admittedly, there are other heuristics that the oom killer uses to create
> a badness score. ?But since this patch is only changing the baseline from
> mm->total_vm to get_mm_rss(mm), its behavior in this test case do not
> match the patch description.
>
> The vast majority of the other ooms have identical top 8 candidates:
>
> total_vm
> 673222 test
> 195695 krunner
> 168881 plasma-desktop
> 130567 ktorrent
> 127081 knotify4
> 125881 icedove-bin
> 123036 akregator
> 121869 firefox-bin
>
> rss
> 672271 test
> 42192 Xorg
> 30763 firefox-bin
> 13292 icedove-bin
> 10208 ktorrent
> 9260 akregator
> 8859 plasma-desktop
> 7528 krunner
>
> firefox-bin seems much more preferred in this case than total_vm, but Xorg
> still ranks very high with this patch compared to the current
> implementation.

Hi David,

I'm very interesting your pointing out. thanks good testing.
So, I'd like to clarify your point a bit.

following are badness list on my desktop environment (x86_64 6GB mem).
it show Xorg have pretty small badness score. Do you know why such
different happen?


score pid comm
==============================
56382 3241 run-mozilla.sh
23345 3289 run-mozilla.sh
21461 3050 gnome-do
20079 2867 gnome-session
14016 3258 firefox
9212 3306 firefox
8468 3115 gnome-do
6902 3325 emacs
6783 3212 tomboy
4865 2968 python
4861 2948 nautilus
4221 1 init
(snip about 100line)
548 2590 Xorg

2009-11-02 10:42:08

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Sun, 1 Nov 2009, KOSAKI Motohiro wrote:

> > total_vm
> > 673222 test
> > 195695 krunner
> > 168881 plasma-desktop
> > 130567 ktorrent
> > 127081 knotify4
> > 125881 icedove-bin
> > 123036 akregator
> > 121869 firefox-bin
> >
> > rss
> > 672271 test
> > 42192 Xorg
> > 30763 firefox-bin
> > 13292 icedove-bin
> > 10208 ktorrent
> > 9260 akregator
> > 8859 plasma-desktop
> > 7528 krunner
> >
> > firefox-bin seems much more preferred in this case than total_vm, but Xorg
> > still ranks very high with this patch compared to the current
> > implementation.
>
> Hi David,
>
> I'm very interesting your pointing out. thanks good testing.
> So, I'd like to clarify your point a bit.
>
> following are badness list on my desktop environment (x86_64 6GB mem).
> it show Xorg have pretty small badness score. Do you know why such
> different happen?
>

I don't know specifically what's different on your machine than Vedran's,
my data is simply a collection of the /proc/sys/vm/oom_dump_tasks output
from Vedran's oom log.

I guess we could add a call to badness() for the oom_dump_tasks tasklist
dump to get a clearer picture so we know the score for each thread group
leader. Anything else would be speculation at this point, though.

> score pid comm
> ==============================
> 56382 3241 run-mozilla.sh
> 23345 3289 run-mozilla.sh
> 21461 3050 gnome-do
> 20079 2867 gnome-session
> 14016 3258 firefox
> 9212 3306 firefox
> 8468 3115 gnome-do
> 6902 3325 emacs
> 6783 3212 tomboy
> 4865 2968 python
> 4861 2948 nautilus
> 4221 1 init
> (snip about 100line)
> 548 2590 Xorg
>

Are these scores with your rss patch or without? If it's without the
patch, this is understandable since Xorg didn't appear highly in Vedran's
log either.

2009-11-02 12:35:56

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

>> Hi David,
>>
>> I'm very interesting your pointing out. thanks good testing.
>> So, I'd like to clarify your point a bit.
>>
>> following are badness list on my desktop environment (x86_64 6GB mem).
>> it show Xorg have pretty small badness score. Do you know why such
>> different happen?
>
> I don't know specifically what's different on your machine than Vedran's,
> my data is simply a collection of the /proc/sys/vm/oom_dump_tasks output
> from Vedran's oom log.
>
> I guess we could add a call to badness() for the oom_dump_tasks tasklist
> dump to get a clearer picture so we know the score for each thread group
> leader. Anything else would be speculation at this point, though.
>
>> score pid comm
>> ==============================
>> 56382 3241 run-mozilla.sh
>> 23345 3289 run-mozilla.sh
>> 21461 3050 gnome-do
>> 20079 2867 gnome-session
>> 14016 3258 firefox
>> 9212 3306 firefox
>> 8468 3115 gnome-do
>> 6902 3325 emacs
>> 6783 3212 tomboy
>> 4865 2968 python
>> 4861 2948 nautilus
>> 4221 1 init
>> (snip about 100line)
>> 548 2590 Xorg
>>
>
> Are these scores with your rss patch or without? If it's without the
> patch, this is understandable since Xorg didn't appear highly in Vedran's
> log either.

Oh, I'm sorry. I mesured with rss patch.
Then, I haven't understand what makes Xorg bad score.

Hmm...
Vedran, Can you please post following command result?

# cat /proc/`pidof Xorg`/smaps


I hope to undestand the issue clearly before modify any code.

2009-11-02 19:55:55

by Vedran Furač

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

KOSAKI Motohiro wrote:

> Oh, I'm sorry. I mesured with rss patch.
> Then, I haven't understand what makes Xorg bad score.
>
> Hmm...
> Vedran, Can you please post following command result?
>
> # cat /proc/`pidof Xorg`/smaps
>
> I hope to undestand the issue clearly before modify any code.


No problem:

http://pastebin.com/d66972025 (long)

Xorg is from debian unstable.

Regards,

Vedran

2009-11-03 23:09:11

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

> KOSAKI Motohiro wrote:
>
> > Oh, I'm sorry. I mesured with rss patch.
> > Then, I haven't understand what makes Xorg bad score.
> >
> > Hmm...
> > Vedran, Can you please post following command result?
> >
> > # cat /proc/`pidof Xorg`/smaps
> >
> > I hope to undestand the issue clearly before modify any code.
>
> No problem:
>
> http://pastebin.com/d66972025 (long)
>
> Xorg is from debian unstable.

Hmm...

Your Xorg have pretty large heap. I'm not sure why it happen.
(ATI video card issue?)
Unfortunatelly, It is showed as normal large heap from kernel. then,
I doubt kernel can distinguish X from other process. Probably
oom-adj is most reasonable option....

-------------------------------------------------
[heap]
Size: 433812 kB
Rss: 433304 kB
Pss: 433304 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 280 kB
Private_Dirty: 433024 kB
Referenced: 415656 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB



2009-11-07 19:22:01

by Vedran Furač

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

KOSAKI Motohiro wrote:

> Your Xorg have pretty large heap. I'm not sure why it happen. (ATI
> video card issue?)

It is ATI (fglrx), but I don't know if it is driver's issue or not. I
have a lot of apps running, firefox with high number of tabs and so on.
It adds up probably.

> Unfortunatelly, It is showed as normal large heap from kernel. then,
> I doubt kernel can distinguish X from other process. Probably oom-adj
> is most reasonable option....
>
> -------------------------------------------------
> [heap]
> Size: 433812 kB
> Rss: 433304 kB
> Pss: 433304 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 280 kB
> Private_Dirty: 433024 kB
> Referenced: 415656 kB
> Swap: 0 kB
> KernelPageSize: 4 kB
> MMUPageSize: 4 kB


2009-11-25 12:45:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

Hello,

lengthy discussion on something I think is quite obviously better and
I tried to change a couple of years back already (rss instead of
total_vm).

On Thu, Oct 29, 2009 at 01:31:59AM -0700, David Rientjes wrote:
> total_vm
> 708945 test
> 195695 krunner
> 168881 plasma-desktop
> 130567 ktorrent
> 127081 knotify4
> 125881 icedove-bin
> 123036 akregator
> 118641 kded4
>
> rss
> 707878 test
> 42201 Xorg
> 13300 icedove-bin
> 10209 ktorrent
> 9277 akregator
> 8878 plasma-desktop
> 7546 krunner
> 4532 mysqld
>
> This patch would pick the memory hogging task, "test", first everytime

That is by far the only thing that matters. There's plenty of logic in
the oom killer to remove races with tasks with TIF_MEMDIE set, to
ensure not to fall into the second task until the first task had the
time to release all its memory back to the system.

> just like the current implementation does. It would then prefer Xorg,

You're focusing on the noise and not looking at the only thing that
matters.

The noise level with rss went down to 50000, it doesn't matter the
order of what's below 50000. Only thing it matters is the _delta_
between "noise-level innocent apps" and "exploit".

The delta is clearly increase from 708945-max(noise) to
707878-max(noise) which translates to a increase of precision from
513250 to 665677, which shows how much more rss is making the
detection more accurate (i.e. the distance between exploit and first
innocent app). The lower level the noise level starts, the less likely
the innocent apps are killed.

There's simply no way to get to perfection, some innocent apps will
always have high total_vm or rss levels, but this at least removes
lots of innocent apps from the equation. The fact X isn't less
innocent than before is because its rss is quite big, and this is not
an error, luckily much smaller than the hog itself. Surely there are
ways to force X to load huge bitmaps into its address space too
(regardless of total_vm or rss) but again no perfection, just better
with rss even in this testcase.

2009-11-25 21:40:11

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Wed, 25 Nov 2009, Andrea Arcangeli wrote:

> You're focusing on the noise and not looking at the only thing that
> matters.
>
> The noise level with rss went down to 50000, it doesn't matter the
> order of what's below 50000. Only thing it matters is the _delta_
> between "noise-level innocent apps" and "exploit".
>
> The delta is clearly increase from 708945-max(noise) to
> 707878-max(noise) which translates to a increase of precision from
> 513250 to 665677, which shows how much more rss is making the
> detection more accurate (i.e. the distance between exploit and first
> innocent app). The lower level the noise level starts, the less likely
> the innocent apps are killed.
>

That's not surprising since the amount of physical RAM is the constraining
factor.

> There's simply no way to get to perfection, some innocent apps will
> always have high total_vm or rss levels, but this at least removes
> lots of innocent apps from the equation. The fact X isn't less
> innocent than before is because its rss is quite big, and this is not
> an error, luckily much smaller than the hog itself. Surely there are
> ways to force X to load huge bitmaps into its address space too
> (regardless of total_vm or rss) but again no perfection, just better
> with rss even in this testcase.
>

We use the oom killer as a mechanism to enforce memory containment policy,
we are much more interested in the oom killing priority than the oom
killer's own heuristics to determine the ideal task to kill. Those
heuristics can't possibly represent the priorities for all possible
workloads, so we require input from the user via /proc/pid/oom_adj to
adjust that heuristic. That has traditionally always used total_vm as a
baseline which is a much more static value and can be quantified within a
reasonable range by experimental data when it would not be defined as
rogue. By changing the baseline to rss, we lose much of that control
since its more dynamic and dependent on the current state of the machine
at the time of the oom which can be predicted with less accuracy.

2009-11-26 00:10:12

by Vedran Furač

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

Andrea Arcangeli wrote:

> Hello,

Hi all!

> lengthy discussion on something I think is quite obviously better and
> I tried to change a couple of years back already (rss instead of
> total_vm).

Now that 2.6.32 is almost out, is it possible to get OOMK fixed in
2.6.33 so that I could turn overcommit on (overcommit_memory=0) again
without fear of loosing my work?

Regards,

Vedran


--
http://vedranf.net | a8e7a7783ca0d460fee090cc584adc12

2009-11-26 01:35:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Thu, 26 Nov 2009 01:10:12 +0100
Vedran Furač <[email protected]> wrote:

> Andrea Arcangeli wrote:
>
> > Hello,
>
> Hi all!
>
> > lengthy discussion on something I think is quite obviously better and
> > I tried to change a couple of years back already (rss instead of
> > total_vm).
>
> Now that 2.6.32 is almost out, is it possible to get OOMK fixed in
> 2.6.33 so that I could turn overcommit on (overcommit_memory=0) again
> without fear of loosing my work?
>
I'll try fork-bomb detector again. That will finally help your X.org.
But It may lose 2.6.33.

Adding new counter to mm_struct is now rejected because of scalability, so
total work will need more time (than expected).
I'm sorry I can't get enough time in these weeks.

Thanks,
-Kame

2009-11-27 01:56:43

by Vedran Furač

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

KAMEZAWA Hiroyuki wrote:

> On Thu, 26 Nov 2009 01:10:12 +0100
> Vedran Furač <[email protected]> wrote:
>
>> Andrea Arcangeli wrote:
>>
>>> lengthy discussion on something I think is quite obviously better and
>>> I tried to change a couple of years back already (rss instead of
>>> total_vm).
>> Now that 2.6.32 is almost out, is it possible to get OOMK fixed in
>> 2.6.33 so that I could turn overcommit on (overcommit_memory=0) again
>> without fear of loosing my work?
>>
> I'll try fork-bomb detector again. That will finally help your X.org.
> But It may lose 2.6.33.
>
> Adding new counter to mm_struct is now rejected because of scalability, so
> total work will need more time (than expected).
> I'm sorry I can't get enough time in these weeks.

Thanks for working on this! Hope it gets into 33. Keep me posted.

Regards,

Vedran


--
http://vedranf.net | a8e7a7783ca0d460fee090cc584adc12

2009-11-27 18:26:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Wed, Nov 25, 2009 at 01:39:59PM -0800, David Rientjes wrote:
> adjust that heuristic. That has traditionally always used total_vm as a
> baseline which is a much more static value and can be quantified within a
> reasonable range by experimental data when it would not be defined as
> rogue. By changing the baseline to rss, we lose much of that control
> since its more dynamic and dependent on the current state of the machine
> at the time of the oom which can be predicted with less accuracy.

Ok I can see the fact by being dynamic and less predictable worries
you. The "second to last" tasks especially are going to be less
predictable, but the memory hog would normally end up accounting for
most of the memory and this should increase the badness delta between
the offending tasks (or tasks) and the innocent stuff, so making it
more reliable. The innocent stuff should be more and more paged out
from ram. So I tend to think it'll be much less likely to kill an
innocent task this way (as demonstrated in practice by your
measurement too), but it's true there's no guarantee it'll always do
the right thing, because it's a heuristic anyway, but even total_vm
doesn't provide guarantee unless your workload is stationary and your
badness scores are fixed and no virtual memory is ever allocated by
any task in the system and no new task are spawned.

It'd help if you posted a regression showing smaller delta between
oom-target task and second task. My email was just to point out, your
measurement was a good thing in oom killing terms. If I've to imagine
the worst case for this, is an app allocating memory at very low
peace, and then slowly getting swapped out and taking huge swap
size. Maybe we need to add swap size to rss, dunno, but the paged out
MAP_SHARED equivalent can't be accounted like we can account swap
size, so in practice I feel a raw rss is going to be more practical
than making swap special vs file mappings.

2009-11-30 23:09:47

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Fri, 27 Nov 2009, Andrea Arcangeli wrote:

> Ok I can see the fact by being dynamic and less predictable worries
> you. The "second to last" tasks especially are going to be less
> predictable, but the memory hog would normally end up accounting for
> most of the memory and this should increase the badness delta between
> the offending tasks (or tasks) and the innocent stuff, so making it
> more reliable. The innocent stuff should be more and more paged out
> from ram. So I tend to think it'll be much less likely to kill an
> innocent task this way (as demonstrated in practice by your
> measurement too), but it's true there's no guarantee it'll always do
> the right thing, because it's a heuristic anyway, but even total_vm
> doesn't provide guarantee unless your workload is stationary and your
> badness scores are fixed and no virtual memory is ever allocated by
> any task in the system and no new task are spawned.
>

The purpose of /proc/pid/oom_adj is not always to polarize the heuristic
for the task it represents, it allows userspace to define when a task is
rogue. Working with total_vm as a baseline, it is simple to use the
interface to tune the heuristic to prefer a certain task over another when
its memory consumption goes beyond what is expected. With this interface,
I can easily define when an application should be oom killed because it is
using far more memory than expected. I can also disable oom killing
completely for it, if necessary. Unless you have a consistent baseline
for all tasks, the adjustment wouldn't contextually make any sense. Using
rss does not allow users to statically define when a task is rogue and is
dependent on the current state of memory at the time of oom.

I would support removing most of the other heuristics other than the
baseline and the nodes intersection with mems_allowed to prefer tasks in
the same cpuset, though, to make it easier to understand and tune.

2009-12-01 04:43:33

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

> On Fri, 27 Nov 2009, Andrea Arcangeli wrote:
>
> > Ok I can see the fact by being dynamic and less predictable worries
> > you. The "second to last" tasks especially are going to be less
> > predictable, but the memory hog would normally end up accounting for
> > most of the memory and this should increase the badness delta between
> > the offending tasks (or tasks) and the innocent stuff, so making it
> > more reliable. The innocent stuff should be more and more paged out
> > from ram. So I tend to think it'll be much less likely to kill an
> > innocent task this way (as demonstrated in practice by your
> > measurement too), but it's true there's no guarantee it'll always do
> > the right thing, because it's a heuristic anyway, but even total_vm
> > doesn't provide guarantee unless your workload is stationary and your
> > badness scores are fixed and no virtual memory is ever allocated by
> > any task in the system and no new task are spawned.
> >
>
> The purpose of /proc/pid/oom_adj is not always to polarize the heuristic
> for the task it represents, it allows userspace to define when a task is
> rogue. Working with total_vm as a baseline, it is simple to use the
> interface to tune the heuristic to prefer a certain task over another when
> its memory consumption goes beyond what is expected. With this interface,
> I can easily define when an application should be oom killed because it is
> using far more memory than expected. I can also disable oom killing
> completely for it, if necessary. Unless you have a consistent baseline
> for all tasks, the adjustment wouldn't contextually make any sense. Using
> rss does not allow users to statically define when a task is rogue and is
> dependent on the current state of memory at the time of oom.
>
> I would support removing most of the other heuristics other than the
> baseline and the nodes intersection with mems_allowed to prefer tasks in
> the same cpuset, though, to make it easier to understand and tune.

I feel you talked about oom_adj doesn't fit your use case. probably you need
/proc/{pid}/oom_priority new knob. oom adjustment doesn't fit you.
you need job severity based oom killing order. severity doesn't depend on any
hueristic.
server administrator should know job severity on his system.

OOM heuristic should mainly consider desktop usage. because desktop user
doesn't change oom knob at all. and they doesn't know what deamon is important.
any userful heuristics have some dynamically aspect. we can't avoid it.

thought?

2009-12-01 22:20:22

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Tue, 1 Dec 2009, KOSAKI Motohiro wrote:

> > The purpose of /proc/pid/oom_adj is not always to polarize the heuristic
> > for the task it represents, it allows userspace to define when a task is
> > rogue. Working with total_vm as a baseline, it is simple to use the
> > interface to tune the heuristic to prefer a certain task over another when
> > its memory consumption goes beyond what is expected. With this interface,
> > I can easily define when an application should be oom killed because it is
> > using far more memory than expected. I can also disable oom killing
> > completely for it, if necessary. Unless you have a consistent baseline
> > for all tasks, the adjustment wouldn't contextually make any sense. Using
> > rss does not allow users to statically define when a task is rogue and is
> > dependent on the current state of memory at the time of oom.
> >
> > I would support removing most of the other heuristics other than the
> > baseline and the nodes intersection with mems_allowed to prefer tasks in
> > the same cpuset, though, to make it easier to understand and tune.
>
> I feel you talked about oom_adj doesn't fit your use case. probably you need
> /proc/{pid}/oom_priority new knob. oom adjustment doesn't fit you.
> you need job severity based oom killing order. severity doesn't depend on any
> hueristic.
> server administrator should know job severity on his system.
>

That's the complete opposite of what I wrote above, we use oom_adj to
define when a user application is considered "rogue," meaning that it is
using far more memory than expected and so we want it killed. As you
mentioned weeks ago, the kernel cannot identify a memory leaker; this is
the user interface to allow the oom killer to identify a memory-hogging
rogue task that will (probably) consume all system memory eventually.
The way oom_adj is implemented, with a bit shift on a baseline of
total_vm, it can also polarize the badness heuristic to kill an
application based on priority by examining /proc/pid/oom_score, but that
wasn't my concern in this case. Using rss as a baseline reduces my
ability to tune oom_adj appropriately to identify those rogue tasks
because it is highly dynamic depending on the state of the VM at the time
of oom.

2009-12-02 00:35:17

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

> On Tue, 1 Dec 2009, KOSAKI Motohiro wrote:
>
> > > The purpose of /proc/pid/oom_adj is not always to polarize the heuristic
> > > for the task it represents, it allows userspace to define when a task is
> > > rogue. Working with total_vm as a baseline, it is simple to use the
> > > interface to tune the heuristic to prefer a certain task over another when
> > > its memory consumption goes beyond what is expected. With this interface,
> > > I can easily define when an application should be oom killed because it is
> > > using far more memory than expected. I can also disable oom killing
> > > completely for it, if necessary. Unless you have a consistent baseline
> > > for all tasks, the adjustment wouldn't contextually make any sense. Using
> > > rss does not allow users to statically define when a task is rogue and is
> > > dependent on the current state of memory at the time of oom.
> > >
> > > I would support removing most of the other heuristics other than the
> > > baseline and the nodes intersection with mems_allowed to prefer tasks in
> > > the same cpuset, though, to make it easier to understand and tune.
> >
> > I feel you talked about oom_adj doesn't fit your use case. probably you need
> > /proc/{pid}/oom_priority new knob. oom adjustment doesn't fit you.
> > you need job severity based oom killing order. severity doesn't depend on any
> > hueristic.
> > server administrator should know job severity on his system.
>
> That's the complete opposite of what I wrote above, we use oom_adj to
> define when a user application is considered "rogue," meaning that it is
> using far more memory than expected and so we want it killed. As you
> mentioned weeks ago, the kernel cannot identify a memory leaker; this is
> the user interface to allow the oom killer to identify a memory-hogging
> rogue task that will (probably) consume all system memory eventually.
> The way oom_adj is implemented, with a bit shift on a baseline of
> total_vm, it can also polarize the badness heuristic to kill an
> application based on priority by examining /proc/pid/oom_score, but that
> wasn't my concern in this case. Using rss as a baseline reduces my
> ability to tune oom_adj appropriately to identify those rogue tasks
> because it is highly dynamic depending on the state of the VM at the time
> of oom.

- I mean you don't need almost kernel heuristic. but desktop user need it.
- All job scheduler provide memory limitation feature. but OOM killer isn't
for to implement memory limitation. we have memory cgroup.
- if you need memory usage based know, read /proc/{pid}/statm and write
/proc/{pid}/oom_priority works well probably.
- Unfortunatelly, We can't continue to use VSZ based heuristics. because
modern application waste 10x VSZ more than RSS comsumption. in nowadays,
VSZ isn't good approximation value of RSS. There isn't any good reason to
continue form desktop user view.

IOW, kernel hueristic should adjust to target majority user. we provide a knob
to help minority user.

or, Can you have any detection idea to distigish typical desktop and your use case?


2009-12-03 23:25:17

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Wed, 2 Dec 2009, KOSAKI Motohiro wrote:

> - I mean you don't need almost kernel heuristic. but desktop user need it.

My point is that userspace needs to be able to identify memory leaking
tasks and polarize oom killing priorities. /proc/pid/oom_adj does a good
job of both with total_vm as a baseline.

> - All job scheduler provide memory limitation feature. but OOM killer isn't
> for to implement memory limitation. we have memory cgroup.

Wrong, the oom killer implements cpuset memory limitations.

> - if you need memory usage based know, read /proc/{pid}/statm and write
> /proc/{pid}/oom_priority works well probably.

Constantly polling /proc/pid/stat and updating the oom killer priorities
at a constant interval is a ridiculous proposal for identifying memory
leakers, sorry.

> - Unfortunatelly, We can't continue to use VSZ based heuristics. because
> modern application waste 10x VSZ more than RSS comsumption. in nowadays,
> VSZ isn't good approximation value of RSS. There isn't any good reason to
> continue form desktop user view.
>

Then leave the heuristic alone by default so we don't lose any
functionality that we once had and then add additional heuristics
depending on the environment as determined by the manipulation of a new
tunable.

> IOW, kernel hueristic should adjust to target majority user. we provide a knob
> to help minority user.
>

Moving the baseline to rss severely impacts the legitimacy of that knob,
we lose a lot of control over identifying memory leakers and polarizing
oom killer priorities because it depends on the state of the VM at the
time of oom for which /proc/pid/oom_adj may not have recently been updated
to represent.

I don't know why you continuously invoke the same arguments to completely
change the baseline for the oom killer heuristic because you falsely
believe that killing the task with the largest memory resident in RAM is
more often than not the ideal task to kill. It's very frustrating when
you insist on changing the default heuristic based on your own belief that
people use Linux in the same way you do.

If Andrew pushes the patch to change the baseline to rss
(oom_kill-use-rss-instead-of-vm-size-for-badness.patch) to Linus, I'll
strongly nack it because you totally lack the ability to identify memory
leakers as defined by userspace which should be the prime target for the
oom killer. You have not addressed that problem, you've merely talked
around it, and yet the patch unbelievably still sits in -mm.

2009-12-04 00:47:32

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] oom_kill: use rss value instead of vm size for badness

On Thu, 3 Dec 2009 15:25:05 -0800 (PST)
David Rientjes <[email protected]> wrote:

> If Andrew pushes the patch to change the baseline to rss
> (oom_kill-use-rss-instead-of-vm-size-for-badness.patch) to Linus, I'll
> strongly nack it because you totally lack the ability to identify memory
> leakers as defined by userspace which should be the prime target for the
> oom killer. You have not addressed that problem, you've merely talked
> around it, and yet the patch unbelievably still sits in -mm.
>
It's still cook-time about oom-kill patches and I'll ask Andrew not to send
it when he asks in mm-merge plan. At least, per-mm swap counter and lowmem-rss
counter is necessary. I'll rewrite fork-bomb detector, too.

Repeatedly saying, calculating badness from vm_size _only_ is bad.
I'm not sure how google's magical applications works, but in general,
vm_size doesn't means private memory usage i.e. how well oom-killer can free
pages.
And current oom-killer kills wrong process. Please share your idea to making
oom-killer better rather than just saying "don't do that".

Do you have good algorithm for detecting memory-leaking process in user land ?
I think I added some in my old set but it's not enough.
I'll add more statistics to mm_struct to do better work.

BTW, I hate oom_adj very much. It's nature of "shift" is hard to understand.
I wonder why static oom priority or oom_threshold was not implemented...

Thanks,
-Kame