2010-02-01 22:02:44

by Lubos Lunak

[permalink] [raw]
Subject: Improving OOM killer


Hello,

I'd like to suggest some changes to the OOM killer code that I believe
improve its behaviour - here it turns it from something completely useless and
harmful into something that works very well. I'm first posting changes for
review, I'll reformat the patch later as needed if approved. Given that it's
been like this for about a decade, I get the feeling there must be some
strange catch.

My scenario is working in a KDE desktop session and accidentally running
parallel make in doc/ subdirectory of sources of a KDE module. As I use
distributed compiling, I run make with -j20 or more, but, as the tool used for
processing KDE documentation is quite memory-intensive, running this many
of them is more than enough to consume all the 2GB RAM in the machine. What
happens in that case is that the machine becomes virtually unresponsible,
where even Ctrl+Alt+F1 can take minutes, not to mention some action that'd
actually redeem the situation. If I wait long enough for something to happen,
which can be even hours, the action that ends the situation is killing one of
the most vital KDE processes, rendering the whole session useless and making
me lose all unsaved data.

The process tree looks roughly like this:

init
|- kdeinit
| |- ksmserver
| | |- kwin
| |- <other>
|- konsole
|- make
|- sh
| |- meinproc4
|- sh
| |- meinproc4
|- <etc>

What happens is that OOM killer usually selects either ksmserver (KDE session
manager) or kdeinit (KDE master process that spawns most KDE processes). Note
that in either case OOM killer does not reach the point of killing the actual
offender - it will randomly kill in the tree under kdeinit until it decides
to kill ksmserver, which means terminating the desktop session. As konsole is
a KUniqueApplication, it forks into background and gets reparented to init,
thus getting away from the kdeinit subtree. Since the memory pressure is
distributed among several meinproc4 processes, the badness does not get
summed up in its make grandparent, as badness() does this only for direct
parents.

Each meinproc4 process still uses a considerable amount of memory, so one
could assume that the situation would be solved by simply killing them one by
one, but it is not so because of using what I consider poor metric for
measuring memory usage - VmSize. VmSize, if I'm not mistaken, is the size of
the address space taken by the process, which in practice does not say much
about how much memory the process actually uses. For example, /proc/*/status
for one selected KDE process:

VmPeak: 534676 kB
VmSize: 528340 kB
VmLck: 0 kB
VmHWM: 73464 kB
VmRSS: 73388 kB
VmData: 142332 kB
VmStk: 92 kB
VmExe: 44 kB
VmLib: 91232 kB
VmPTE: 716 kB

And various excerpts from /proc/*/smaps for this process:
...
7f7b3f800000-7f7b40000000 rwxp 00000000 00:00 0
Size: 8192 kB
Rss: 16 kB
Referenced: 16 kB
...
7f7b40055000-7f7b44000000 ---p 00000000 00:00 0
Size: 65196 kB
Rss: 0 kB
Referenced: 0 kB
...
7f7b443cd000-7f7b445cd000 ---p 0001c000 08:01
790267 /usr/lib64/kde4/libnsplugin.so
Size: 2048 kB
Rss: 0 kB
Referenced: 0 kB
...
7f7b48300000-7f7b4927d000 rw-s 00000000 08:01
58690 /var/tmp/kdecache-seli/kpc/kde-icon-cache.data
Size: 15860 kB
Rss: 24 kB
Referenced: 24 kB

I assume the first one is stack, search me what the second and third ones are
(there appears to be one such mapping as the third one for each .so used),
the last one is a mapping of a large cache file that's nevertheless rarely
used extensively and even then it's backed by a file. In other words, none of
this actually uses much of real memory, yet right now it's the process that
would get killed for using about 70MB memory, even though it's not the
offender. The offender scores only about 1/3 of its badness, even though it
uses almost the double amount of memory:

VmPeak: 266508 kB
VmSize: 266504 kB
VmLck: 0 kB
VmHWM: 118208 kB
VmRSS: 118208 kB
VmData: 98512 kB
VmStk: 84 kB
VmExe: 60 kB
VmLib: 48944 kB
VmPTE: 536 kB

And the offender is only 14th in the list of badness candidates. Speaking of
which, the following is quite useful for seeing all processes sorted by
badness:

ls /proc/*/oom_score | grep -v self | sed 's/\(.*\)\/\(.*\)/echo -n "\1 "; \
echo -n "`cat \1\/\2 2>\/dev\/null` "; readlink \1\/exe || echo/'| sh | \
sort -nr +1


Therefore, I suggest doing the following changes in mm/oom_kill.c :

=====
--- linux-2.6.31/mm/oom_kill.c.sav 2010-02-01 22:00:41.614838540 +0100
+++ linux-2.6.31/mm/oom_kill.c 2010-02-01 22:01:08.773757932 +0100
@@ -69,7 +69,7 @@ unsigned long badness(struct task_struct
/*
* The memory size of the process is the basis for the badness.
*/
- points = mm->total_vm;
+ points = get_mm_rss(mm);

/*
* After this unlock we can no longer dereference local variable `mm'
@@ -83,21 +83,6 @@ unsigned long badness(struct task_struct
return ULONG_MAX;

/*
- * Processes which fork a lot of child processes are likely
- * a good choice. We add half the vmsize of the children if they
- * have an own mm. This prevents forking servers to flood the
- * machine with an endless amount of children. In case a single
- * child is eating the vast majority of memory, adding only half
- * to the parents will make the child our kill candidate of choice.
- */
- list_for_each_entry(child, &p->children, sibling) {
- task_lock(child);
- if (child->mm != mm && child->mm)
- points += child->mm->total_vm/2 + 1;
- task_unlock(child);
- }
-
- /*
* CPU time is in tens of seconds and run time is in thousands
* of seconds. There is no particular reason for this other than
* that it turned out to work very well in practice.
=====

In other words, use VmRSS for measuring memory usage instead of VmSize, and
remove child accumulating.

I hope the above is good enough reason for the first change. VmSize includes
things like read-only mappings, memory mappings that is actually unused,
mappings backed by a file, mappings from video drivers, and so on. VmRSS is
actual real memory used, which is what mostly matters here. While it may not
be perfect, it is certainly an improvement.

The second change should be done on the basis that it does more harm than
good. In this specific case, it does not help to identify the source of the
problem, and it incorrectly identifies kdeinit as the problem solely on the
basis that it spawned many other processes. I think it's already quite hinted
that this is a problem by the fact that you had to add a special protection
for init - any session manager, process launcher or even xterm used for
launching apps is yet another init.

I also have problems finding a case where the child accounting would actually
help. I mean, in practice, I can certainly come up with something in theory,
and this looks to me like a solution to a very synthesized problem. In which
realistic case will one process launch a limited number of children, where
all of them will consume memory, but just killing the children one by one
won't avoid the problem reasonably? This is unlikely to avoid a forkbomb, as
in that case the number of children will be the problem. It is not necessary
for just one children misbehaving and being restarted, nor will it work
there. So what is that supposed to fix, and is it more likely than the case
of a process launching several unrelated children?

If the children accounting is supposed to handle cases like forked children
of Apache, then I suggest it is adjusted only to count children that have
been forked from the parent but there has been no exec(). I'm afraid I don't
know how to detect that.


When running a kernel with these changes applied, I can safely do the
above-described case of running parallel doc generation in KDE. No clearly
innocent process is selected for killing, the first choice is always an
offender.

Moreover, the remedy is almost instant, there is only a fraction of second of
when the machine is overloaded by the I/O of swapping pages in and out (I do
not use swap, but there is a large amount of memory used by read-only
mappings of binaries, libraries or various other files that is in the
original case rendering the machine unresponsive - I assume this is because
the kernel tries to kill an innocent process, but the offenders immediatelly
consume anything that is freed, requiring even memory used by code that is to
be executed to be swapped in from files again).

I consider the patches to be definite improvements, so if they are ok, I will
format them as necessary. Now, what is the catch?

--
Lubos Lunak
openSUSE Boosters team, KDE developer
[email protected] , [email protected]


2010-02-01 23:54:09

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Mon, 1 Feb 2010, Lubos Lunak wrote:

> Hello,
>
> I'd like to suggest some changes to the OOM killer code that I believe
> improve its behaviour - here it turns it from something completely useless and
> harmful into something that works very well. I'm first posting changes for
> review, I'll reformat the patch later as needed if approved. Given that it's
> been like this for about a decade, I get the feeling there must be some
> strange catch.
>

Thanks for your detailed explanation! It will be very helpful to have
your perspective and enthusiasm as we address oom killer issues.

I don't quite understand how you can say the oom killer is "completely
useless and harmful." It certainly fulfills its purpose, which is to kill
a memory hogging task so that a page allocation may succeed when reclaim
has failed to free any memory. It's useful in that regard, although I
don't think you'll find any argument that killing user tasks could always
be described as "harmful" in some way, at least to somebody :)

> My scenario is working in a KDE desktop session and accidentally running
> parallel make in doc/ subdirectory of sources of a KDE module. As I use
> distributed compiling, I run make with -j20 or more, but, as the tool used for
> processing KDE documentation is quite memory-intensive, running this many
> of them is more than enough to consume all the 2GB RAM in the machine. What
> happens in that case is that the machine becomes virtually unresponsible,
> where even Ctrl+Alt+F1 can take minutes, not to mention some action that'd
> actually redeem the situation. If I wait long enough for something to happen,
> which can be even hours, the action that ends the situation is killing one of
> the most vital KDE processes, rendering the whole session useless and making
> me lose all unsaved data.
>

That's the point at which direct reclaim can no longer free any memory and
we must invoke the oom killer as opposed to potentially deadlocking the
system.

> The process tree looks roughly like this:
>
> init
> |- kdeinit
> | |- ksmserver
> | | |- kwin
> | |- <other>
> |- konsole
> |- make
> |- sh
> | |- meinproc4
> |- sh
> | |- meinproc4
> |- <etc>
>
> What happens is that OOM killer usually selects either ksmserver (KDE session
> manager) or kdeinit (KDE master process that spawns most KDE processes). Note
> that in either case OOM killer does not reach the point of killing the actual
> offender - it will randomly kill in the tree under kdeinit until it decides
> to kill ksmserver, which means terminating the desktop session. As konsole is
> a KUniqueApplication, it forks into background and gets reparented to init,
> thus getting away from the kdeinit subtree. Since the memory pressure is
> distributed among several meinproc4 processes, the badness does not get
> summed up in its make grandparent, as badness() does this only for direct
> parents.
>

There's no randomness involved in selecting a task to kill; the oom killer
does, however, try to kill a child with seperate memory first instead of
the parent and that selection depends on the ordering of the selected
task's child list. As I mentioned very early this morning, that could
certainly be improved to kill the child with the highest badness() score
first. Something like this untested patch could do it:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -439,6 +439,9 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
const char *message)
{
struct task_struct *c;
+ struct task_struct *victim = NULL;
+ unsigned long victim_points = 0;
+ struct timespec uptime;

if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem);
@@ -455,14 +458,20 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
message, task_pid_nr(p), p->comm, points);

- /* Try to kill a child first */
+ /* Try to kill the most memory-hogging child first */
+ do_posix_clock_monotonic_gettime(&uptime);
list_for_each_entry(c, &p->children, sibling) {
+ unsigned long cpoints;
+
if (c->mm == p->mm)
continue;
- if (!oom_kill_task(c))
- return 0;
+ cpoints = badness(c, uptime.tv_sec);
+ if (cpoints > victim_points) {
+ victim = c;
+ victim_points = cpoints;
+ }
}
- return oom_kill_task(p);
+ return oom_kill_task(victim ? victim : p);
}

#ifdef CONFIG_CGROUP_MEM_RES_CTLR

This works because the conditions on which oom_kill_task(victim) fails is
when !victim->mm or victim->signal->oom_adj == OOM_DISABLE which we're
guarded against since badness(victim) returns 0 for both.

The process tree that you posted shows a textbook case for using
/proc/pid/oom_adj to ensure a critical task, such as kdeinit is to you, is
protected from getting selected for oom kill. In your own words, this
"spawns most KDE processes," so it's an ideal place to set an oom_adj
value of OOM_DISABLE since that value is inheritable to children and,
thus, all children are implicitly protected as well.

> Each meinproc4 process still uses a considerable amount of memory, so one
> could assume that the situation would be solved by simply killing them one by
> one, but it is not so because of using what I consider poor metric for
> measuring memory usage - VmSize. VmSize, if I'm not mistaken, is the size of
> the address space taken by the process, which in practice does not say much
> about how much memory the process actually uses. For example, /proc/*/status
> for one selected KDE process:
>
> VmPeak: 534676 kB
> VmSize: 528340 kB
> VmLck: 0 kB
> VmHWM: 73464 kB
> VmRSS: 73388 kB
> VmData: 142332 kB
> VmStk: 92 kB
> VmExe: 44 kB
> VmLib: 91232 kB
> VmPTE: 716 kB
>
> And various excerpts from /proc/*/smaps for this process:
> ...
> 7f7b3f800000-7f7b40000000 rwxp 00000000 00:00 0
> Size: 8192 kB
> Rss: 16 kB
> Referenced: 16 kB
> ...
> 7f7b40055000-7f7b44000000 ---p 00000000 00:00 0
> Size: 65196 kB
> Rss: 0 kB
> Referenced: 0 kB
> ...
> 7f7b443cd000-7f7b445cd000 ---p 0001c000 08:01
> 790267 /usr/lib64/kde4/libnsplugin.so
> Size: 2048 kB
> Rss: 0 kB
> Referenced: 0 kB
> ...
> 7f7b48300000-7f7b4927d000 rw-s 00000000 08:01
> 58690 /var/tmp/kdecache-seli/kpc/kde-icon-cache.data
> Size: 15860 kB
> Rss: 24 kB
> Referenced: 24 kB
>
> I assume the first one is stack, search me what the second and third ones are
> (there appears to be one such mapping as the third one for each .so used),
> the last one is a mapping of a large cache file that's nevertheless rarely
> used extensively and even then it's backed by a file. In other words, none of
> this actually uses much of real memory, yet right now it's the process that
> would get killed for using about 70MB memory, even though it's not the
> offender. The offender scores only about 1/3 of its badness, even though it
> uses almost the double amount of memory:
>
> VmPeak: 266508 kB
> VmSize: 266504 kB
> VmLck: 0 kB
> VmHWM: 118208 kB
> VmRSS: 118208 kB
> VmData: 98512 kB
> VmStk: 84 kB
> VmExe: 60 kB
> VmLib: 48944 kB
> VmPTE: 536 kB
>

Using VmSize, however, allows us to define the most important task to kill
for the oom killer: memory leakers. Memory leakers are the single most
important tasks to identify with the oom killer and aren't obvious when
using rss because leaked memory does not stay resident in RAM. I
understand your system may not have such a leaker and it is simply
overcommitted on a 2GB machine, but using rss loses that ability. It also
makes tuning oom killer priorities with /proc/pid/oom_adj almost
impossible since a task's rss is highly dynamic and we cannot speculate on
the state of the VM at the time of oom.

> Therefore, I suggest doing the following changes in mm/oom_kill.c :
>
> =====
> --- linux-2.6.31/mm/oom_kill.c.sav 2010-02-01 22:00:41.614838540 +0100
> +++ linux-2.6.31/mm/oom_kill.c 2010-02-01 22:01:08.773757932 +0100
> @@ -69,7 +69,7 @@ unsigned long badness(struct task_struct
> /*
> * The memory size of the process is the basis for the badness.
> */
> - points = mm->total_vm;
> + points = get_mm_rss(mm);
>
> /*
> * After this unlock we can no longer dereference local variable `mm'
> @@ -83,21 +83,6 @@ unsigned long badness(struct task_struct
> return ULONG_MAX;
>
> /*
> - * Processes which fork a lot of child processes are likely
> - * a good choice. We add half the vmsize of the children if they
> - * have an own mm. This prevents forking servers to flood the
> - * machine with an endless amount of children. In case a single
> - * child is eating the vast majority of memory, adding only half
> - * to the parents will make the child our kill candidate of choice.
> - */
> - list_for_each_entry(child, &p->children, sibling) {
> - task_lock(child);
> - if (child->mm != mm && child->mm)
> - points += child->mm->total_vm/2 + 1;
> - task_unlock(child);
> - }
> -
> - /*
> * CPU time is in tens of seconds and run time is in thousands
> * of seconds. There is no particular reason for this other than
> * that it turned out to work very well in practice.
> =====
>
> In other words, use VmRSS for measuring memory usage instead of VmSize, and
> remove child accumulating.
>
> I hope the above is good enough reason for the first change. VmSize includes
> things like read-only mappings, memory mappings that is actually unused,
> mappings backed by a file, mappings from video drivers, and so on. VmRSS is
> actual real memory used, which is what mostly matters here. While it may not
> be perfect, it is certainly an improvement.
>

It's not for a large number of users, the consumer of the largest amount
of rss is not necessarily the task we always want to kill. Just because
an order-0 page allocation fails does not mean we want to kill the task
that would free the largest amount of RAM.

I understand that KDE is extremely important to your work environment and
if you lose it, it seems like a failure of Linux and the VM. However, the
kernel cannot possibly know what applications you believe to be the most
important. For that reason, userspace is able to tune the badness() score
by writing to /proc/pid/oom_adj as I've suggested you do for kdeinit. You
have the ability to protect KDE from getting oom killed, you just need to
use it.

The natural response to this is: well, you can't possibly expect all users
to tune /proc/pid/oom_adj just in case we are oom. I can expect KDE to
protect itself, though, since it is of sufficient importance to its users
that it should protect itself, as I expect the same of the other examples
that have arisen in recent months concerning Xorg.

There's a second caveat, however, to just protecting applications from the
oom killer using OOM_DISABLE. And that goes back to my original point of
being able to define memory leakers. /proc/pid/oom_adj actually works
quite well to define when a task is using far more memory than expected.
You can say the VmSize doesn't accurately reflect the amount of memory an
application is using, but it is both static enough and well understood
from a pure arithemtic standpoint for what a sane range is for a
particular application. It is then possible to use oom_adj to define when
that memory usage is egregious and considered "rogue" but the oom killer
so that it warrants being killed.

> The second change should be done on the basis that it does more harm than
> good. In this specific case, it does not help to identify the source of the
> problem, and it incorrectly identifies kdeinit as the problem solely on the
> basis that it spawned many other processes. I think it's already quite hinted
> that this is a problem by the fact that you had to add a special protection
> for init - any session manager, process launcher or even xterm used for
> launching apps is yet another init.
>

The kernel can't be expected to know that, you must tell it via
/proc/pid/oom_adj.

> I also have problems finding a case where the child accounting would actually
> help. I mean, in practice, I can certainly come up with something in theory,
> and this looks to me like a solution to a very synthesized problem. In which
> realistic case will one process launch a limited number of children, where
> all of them will consume memory, but just killing the children one by one
> won't avoid the problem reasonably? This is unlikely to avoid a forkbomb, as
> in that case the number of children will be the problem. It is not necessary
> for just one children misbehaving and being restarted, nor will it work
> there. So what is that supposed to fix, and is it more likely than the case
> of a process launching several unrelated children?
>

Right, I believe Kame is working on a forkbomb detector that would replace
this logic.

> Moreover, the remedy is almost instant, there is only a fraction of second of
> when the machine is overloaded by the I/O of swapping pages in and out (I do
> not use swap, but there is a large amount of memory used by read-only
> mappings of binaries, libraries or various other files that is in the
> original case rendering the machine unresponsive - I assume this is because
> the kernel tries to kill an innocent process, but the offenders immediatelly
> consume anything that is freed, requiring even memory used by code that is to
> be executed to be swapped in from files again).
>

We may need to look at block congestion timeouts before invoking the oom
killer in the page allocator.

2010-02-02 21:10:13

by Lubos Lunak

[permalink] [raw]
Subject: Re: Improving OOM killer

On Tuesday 02 of February 2010, David Rientjes wrote:
> On Mon, 1 Feb 2010, Lubos Lunak wrote:
> > Hello,

> I don't quite understand how you can say the oom killer is "completely
> useless and harmful." It certainly fulfills its purpose, which is to kill
> a memory hogging task so that a page allocation may succeed when reclaim
> has failed to free any memory.

I started the sentence with "here". And if you compare my description of what
happens with the 5 goals listed in oom_kill.c, you can see that it fails all
of them except for 2, and even in that case it is much better to simply
reboot the computer. So that is why OOM killer _here_ is completely useless
and harmful, as even panic_on_oom does a better job.

I'm not saying it's so everywhere, presumably it works somewhere when
somebody has written it this way, but since some of the design decisions
appear to be rather poor for desktop systems, "here" is probably not really
limited only to my computer either.

> > The process tree looks roughly like this:
> >
> > init
> > |- kdeinit
> > | |- ksmserver
> > | | |- kwin
> > | |- <other>
> > |- konsole
> > |- make
> > |- sh
> > | |- meinproc4
> > |- sh
> > | |- meinproc4
> > |- <etc>
> >
> > What happens is that OOM killer usually selects either ksmserver (KDE
> > session manager) or kdeinit (KDE master process that spawns most KDE
> > processes). Note that in either case OOM killer does not reach the point
> > of killing the actual offender - it will randomly kill in the tree under
> > kdeinit until it decides to kill ksmserver, which means terminating the
> > desktop session. As konsole is a KUniqueApplication, it forks into
> > background and gets reparented to init, thus getting away from the
> > kdeinit subtree. Since the memory pressure is distributed among several
> > meinproc4 processes, the badness does not get summed up in its make
> > grandparent, as badness() does this only for direct parents.
>
> There's no randomness involved in selecting a task to kill;

That was rather a figure of speech, but even if you want to take it
literally, then from the user's point of view it is random. Badness of
kdeinit depends on the number of children it has spawned, badness of
ksmserver depends for example on the number and size of windows open (as its
child kwin is a window and compositing manager).

Not that it really matters - the net result is that OOM killer usually
decides to kill kdeinit or ksmserver, starts killing their children, vital
KDE processes, and since the offenders are not among them, it ends up either
terminating the whole session by killing ksmserver or killing enough vital
processes there to free enough memory for the offenders to finish their work
cleanly.

> The process tree that you posted shows a textbook case for using
> /proc/pid/oom_adj to ensure a critical task, such as kdeinit is to you, is
> protected from getting selected for oom kill. In your own words, this
> "spawns most KDE processes," so it's an ideal place to set an oom_adj
> value of OOM_DISABLE since that value is inheritable to children and,
> thus, all children are implicitly protected as well.

Yes, it's a textbook case, sadly textbook cases are theory and not practice.
I didn't mention it in my first mail to keep it shorter, but we have actually
tried it. First of all, it's rather cumbersome - as it requires root
priviledges, there is one wrapped needed for setuid and another one to avoid
setuid side-effects, moreover the setuid root process needs to stay running
and unset the protection on all children, or it'd be useless again.

Worse, it worked for about a year or two and now it has only shifted the
problem elsewhere and that's it. We now protect kdeinit, which means the OOM
killer's choice will very likely ksmserver then. Ok, so let's say now we
start protecting also ksmserver, that's some additional hassle setting it up,
but that's doable. Now there's a good chance the OOM killer's choice will be
kwin (as a compositing manager it can have quite large mappings because of
graphics drivers). So ok, we need to protect the window manager, but since
that's not a hardcoded component like ksmserver, that's even more hassle.

And, after all that, OOM killer will simply detect yet another innocent
process. I didn't mention that, but the memory statistics I presented for one
selected KDE process in my original mail were actually for an ordinary KDE
application - Konqueror showing a web page. Yet, as you can read in my
original mail, even though it used only about half memory of what the
offender used, it still scored almost tripple of its badness score.

So unless you would suggest I implement my own dynamic badness handling in
userspace, which I hope we can all agree is nonsense, then oom_adj is a
cumbersome non-solution here. It may work in some setups, but it doesn't for
the desktop.

> Using VmSize, however, allows us to define the most important task to kill
> for the oom killer: memory leakers. Memory leakers are the single most
> important tasks to identify with the oom killer and aren't obvious when
> using rss because leaked memory does not stay resident in RAM. I
> understand your system may not have such a leaker and it is simply
> overcommitted on a 2GB machine, but using rss loses that ability.

Interesting point. Am I getting it right that you're saying that VmRSS is
unsuitable because badness should take into account not only the RAM used by
the process but also the swap space used by the process? If yes, then this
rather brings up the question why doesn't the badness calculation then do it
and uses VmSize instead?

I mean, as already demonstrated in the original mail, VmSize clearly can be
very wrong as a representation of memory used. I would actually argue that
VmRSS is still better, as the leaker would eventually fill the swap and start
taking up RAM, but either way, how about this then?

- points = mm->total_vm;
+ points = get_mm_rss(mm) +
get_mm_space_used_in_swap_but_not_in_other_places_like_file_backing(mm);

(I don't know if there's a function doing the latter or how to count it.
Probably not exactly trivial given that I have experience
with /proc/*/stat*-using tools like top reporting rather wrong numbers for
swap usage of processes.)

> It also makes tuning oom killer priorities with /proc/pid/oom_adj almost
> impossible since a task's rss is highly dynamic and we cannot speculate on
> the state of the VM at the time of oom.

I see. However using VmRSS and swap space together avoids this.

> > In other words, use VmRSS for measuring memory usage instead of VmSize,
> > and remove child accumulating.
> >
> > I hope the above is good enough reason for the first change. VmSize
> > includes things like read-only mappings, memory mappings that is actually
> > unused, mappings backed by a file, mappings from video drivers, and so
> > on. VmRSS is actual real memory used, which is what mostly matters here.
> > While it may not be perfect, it is certainly an improvement.
>
> It's not for a large number of users,

You mean, besides all desktop users? In my experience desktop machines start
getting rather useless when their swap usage starts nearing the total RAM
amount, so swap should not be that significant. Moreover, again, it's still
better than VmSize, which can be wildly inaccurate. On my desktop system it
definitely is.

Hmm, maybe you're thinking server setup and that's different, I don't know.
Does the kernel have any "desktop mode"? I wouldn't mind if VmSize was used
on servers if you insist it is better, but on desktop VmSize is just plain
wrong. And, again, I think VmRSS+InSwap is better then either.

> the consumer of the largest amount
> of rss is not necessarily the task we always want to kill. Just because
> an order-0 page allocation fails does not mean we want to kill the task
> that would free the largest amount of RAM.

It's still much better than killing the task that would free the largest
amount of address space. And I cannot think of any better metric than
VmRSS+InSwap. Can you?

> I understand that KDE is extremely important to your work environment and
> if you lose it, it seems like a failure of Linux and the VM. However, the
> kernel cannot possibly know what applications you believe to be the most
> important. For that reason, userspace is able to tune the badness() score
> by writing to /proc/pid/oom_adj as I've suggested you do for kdeinit. You
> have the ability to protect KDE from getting oom killed, you just need to
> use it.

As already explained, I can't. Besides, I'm not expecting a miracle, I simply
expect the kernel to kill the process that takes up the most memory, and the
kernel can possibly know that, it just doesn't do it. What other evidence do
you want to be shown that badness calculated for two processes on their
actual memory usage differs by a multiple of 5 or more?

[snipped description of how oom_adj should help when it in fact wouldn't]

> > I also have problems finding a case where the child accounting would
> > actually help. I mean, in practice, I can certainly come up with
> > something in theory, and this looks to me like a solution to a very
> > synthesized problem. In which realistic case will one process launch a
> > limited number of children, where all of them will consume memory, but
> > just killing the children one by one won't avoid the problem reasonably?
> > This is unlikely to avoid a forkbomb, as in that case the number of
> > children will be the problem. It is not necessary for just one children
> > misbehaving and being restarted, nor will it work there. So what is that
> > supposed to fix, and is it more likely than the case of a process
> > launching several unrelated children?
>
> Right, I believe Kame is working on a forkbomb detector that would replace
> this logic.

Until then, can we dump the current code? Because I have provided one case
where it makes things worse and nobody has provided any case where it makes
things better or any other justification for its existence. There's no point
in keeping code for which nobody knows how it improves things (in reality,
not some textbook case).

And, in case the justification for it is something like "Apache", can we
fast-forward to my improved suggestion to limit this only to children that
are forked but not exec()-ed?

--
Lubos Lunak
openSUSE Boosters team, KDE developer
[email protected] , [email protected]

2010-02-03 01:41:57

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Tue, 2 Feb 2010, Lubos Lunak wrote:

> > > init
> > > |- kdeinit
> > > | |- ksmserver
> > > | | |- kwin
> > > | |- <other>
> > > |- konsole
> > > |- make
> > > |- sh
> > > | |- meinproc4
> > > |- sh
> > > | |- meinproc4
> > > |- <etc>
> > >
> > > What happens is that OOM killer usually selects either ksmserver (KDE
> > > session manager) or kdeinit (KDE master process that spawns most KDE
> > > processes). Note that in either case OOM killer does not reach the point
> > > of killing the actual offender - it will randomly kill in the tree under
> > > kdeinit until it decides to kill ksmserver, which means terminating the
> > > desktop session. As konsole is a KUniqueApplication, it forks into
> > > background and gets reparented to init, thus getting away from the
> > > kdeinit subtree. Since the memory pressure is distributed among several
> > > meinproc4 processes, the badness does not get summed up in its make
> > > grandparent, as badness() does this only for direct parents.
> >
> > There's no randomness involved in selecting a task to kill;
>
> That was rather a figure of speech, but even if you want to take it
> literally, then from the user's point of view it is random. Badness of
> kdeinit depends on the number of children it has spawned, badness of
> ksmserver depends for example on the number and size of windows open (as its
> child kwin is a window and compositing manager).
>

As I've mentioned, I believe Kame (now cc'd) is working on replacing the
heuristic that adds the VM size for children into the parent task's
badness score with a forkbomb detector. That should certainly help to
reduce oom killing for parents whose children consume a lot of memory,
especially in your scenario where kdeinit is responsible for forking most
KDE processes.

> Not that it really matters - the net result is that OOM killer usually
> decides to kill kdeinit or ksmserver, starts killing their children, vital
> KDE processes, and since the offenders are not among them, it ends up either
> terminating the whole session by killing ksmserver or killing enough vital
> processes there to free enough memory for the offenders to finish their work
> cleanly.
>

The kernel cannot possibly know what you consider a "vital" process, for
that understanding you need to tell it using the very powerful
/proc/pid/oom_adj tunable. I suspect if you were to product all of
kdeinit's children by patching it to be OOM_DISABLE so that all threads it
forks will inherit that value you'd actually see much improved behavior.
I'd also encourage you to talk to the KDE developers to ensure that proper
precautions are taken to protect it in such conditions since people who
use such desktop environments typically don't want them to be sacrificed
for memory. The kernel, however, can only provide a mechanism for users
to define what they believe to be critical since it will vary widely
depending on how Linux is used for desktops, servers, and embedded
devices. Our servers, for example, have vital threads that are running on
each machine and they need to be protected in the same way. They set
themselves to be OOM_DISABLE.

On a side note, it's only possible to lower an oom_adj value for a thread
if you have CAP_SYS_RESOURCE. So that would be a prerequisite to setting
OOM_DISABLE for any thread. If that's not feasible, there's a workaround:
user tasks can always increase their own oom_adj value so that they are
always preferred in oom conditions. This will act to protect those vital
tasks by preempting them from getting oom killed without any special
capability.

> > The process tree that you posted shows a textbook case for using
> > /proc/pid/oom_adj to ensure a critical task, such as kdeinit is to you, is
> > protected from getting selected for oom kill. In your own words, this
> > "spawns most KDE processes," so it's an ideal place to set an oom_adj
> > value of OOM_DISABLE since that value is inheritable to children and,
> > thus, all children are implicitly protected as well.
>
> Yes, it's a textbook case, sadly textbook cases are theory and not practice.
> I didn't mention it in my first mail to keep it shorter, but we have actually
> tried it. First of all, it's rather cumbersome - as it requires root
> priviledges, there is one wrapped needed for setuid and another one to avoid
> setuid side-effects, moreover the setuid root process needs to stay running
> and unset the protection on all children, or it'd be useless again.
>

It only requires CAP_SYS_RESOURCE, actually, and your apparent difficulty
in writing to a kernel tunable is outside the scope of this discussion,
unfortunately. The bottomline is that the kernel provides a very well
defined interface for tuning the oom killer's badness heuristic so that
userspace can define what is "vital," whether that is kdeinit, ksmserver,
or any other thread. The choice needs to be made to use it, however.

> Worse, it worked for about a year or two and now it has only shifted the
> problem elsewhere and that's it. We now protect kdeinit, which means the OOM
> killer's choice will very likely ksmserver then. Ok, so let's say now we
> start protecting also ksmserver, that's some additional hassle setting it up,
> but that's doable. Now there's a good chance the OOM killer's choice will be
> kwin (as a compositing manager it can have quite large mappings because of
> graphics drivers). So ok, we need to protect the window manager, but since
> that's not a hardcoded component like ksmserver, that's even more hassle.
>

No, you don't need to protect every KDE process from the oom killer unless
it is going to be a contender for selection. You could certainly do so
for completeness, but it shouldn't be required unless the nature of the
thread demands it such that it forks many vital tasks (kdeinit) or its
first-generation children's memory consumption can't be known either
because it depends on how many children it can fork or their memory
consumption is influenced by the user's work.

> > Using VmSize, however, allows us to define the most important task to kill
> > for the oom killer: memory leakers. Memory leakers are the single most
> > important tasks to identify with the oom killer and aren't obvious when
> > using rss because leaked memory does not stay resident in RAM. I
> > understand your system may not have such a leaker and it is simply
> > overcommitted on a 2GB machine, but using rss loses that ability.
>
> Interesting point. Am I getting it right that you're saying that VmRSS is
> unsuitable because badness should take into account not only the RAM used by
> the process but also the swap space used by the process? If yes, then this
> rather brings up the question why doesn't the badness calculation then do it
> and uses VmSize instead?
>

We don't want to discount VM_RESERVED because the memory it represents
cannot be swapped, which is also an important indicator of the overall
memory usage of any particular application. I understand that your
particular use case may benefit from waiting on block congestion during
the page allocation that triggered the oom killer without making any
progress in direct reclaim, but discounting VM_RESERVED because of its
(sometimes erroneous) use with VM_IO isn't an appropriate indicator of the
thread's memory use since we're not considering the amount mapped for I/O.

> I mean, as already demonstrated in the original mail, VmSize clearly can be
> very wrong as a representation of memory used. I would actually argue that
> VmRSS is still better, as the leaker would eventually fill the swap and start
> taking up RAM, but either way, how about this then?
>

We simply can't afford to wait for a memory leaker to fill up all of swap
before the heuristic works to identify it, that would be extremely slow
depending on the speed of I/O and would actually increase the probability
of needlessly killing the wrong task because the memory leaker will fill
up all of swap while other tasks get killed because they have a large rss.
That heuristic works antagonist to finding the memory leaker since it will
always keep its rss low since that memory will never be touched again once
it is leaked.

> Hmm, maybe you're thinking server setup and that's different, I don't know.
> Does the kernel have any "desktop mode"? I wouldn't mind if VmSize was used
> on servers if you insist it is better, but on desktop VmSize is just plain
> wrong. And, again, I think VmRSS+InSwap is better then either.
>

The heuristics are always well debated in this forum and there's little
chance that we'll ever settle on a single formula that works for all
possible use cases. That makes oom_adj even more vital to the overall
efficiency of the oom killer, I really hope you start to use it to your
advantage.

There are a couple of things that we'll always agree on: its needless to
kill a task that shares a different set of allowed nodes than the
triggering page allocation, for example, since it cannot access that
memory even if freed, and there should be some penalty for tasks that have
a shorter uptime to break ties. I agree that most of the penalties as
currently implemented, however, aren't scientific and don't appropriately
adjust the score in those cases. Dividing the entire VM size by 2, for
example, because the task has a certain trait isn't an ideal formula.

There's also at least one thing that we'd both like to remove: the
factoring of each child's VM size into the badness score for the parent as
a way of detecting a forkbomb. Kame has been working on the forkbomb
detector specifically to address this issue, so we should stay tuned.

The one thing that I must stress, however, is the need for us to be able
to define what a "vital" task is and to define what a "memory leaker" is.
Both of those are currently possible with oom_adj, so we cannot loose this
functionality. Changing the baseline to be rss would be highly dynamic
and not allow us to accurately set an oom_adj value to define when a task
is rogue on a shared system.

That said, I don't think a complete rewrite of the badness function would
be a non-starter: we could easily make all tasks scale to have a badness
score ranging from 0 (never kill) to 1000 (always kill), for example. We
just need to propose a sane set of heuristics so that we don't lose the
configurability from userspace.

I've also disagreed before with always killing the most memory consuming
task whenever we have an oom. I agree with you that sometimes that task
may be the most vital to the system and setting it to be OOM_DISABLE
should not always be required for a simple order-0 allocation that fails
somewhere, especially if its ~__GFP_NOFAIL. What I've recommended to you
would work with current mainline and at least kernels released within the
past few years, but I think there may be many more changes we can look to
make in the future.

> > the consumer of the largest amount
> > of rss is not necessarily the task we always want to kill. Just because
> > an order-0 page allocation fails does not mean we want to kill the task
> > that would free the largest amount of RAM.
>
> It's still much better than killing the task that would free the largest
> amount of address space. And I cannot think of any better metric than
> VmRSS+InSwap. Can you?
>

Yes, baselining on total_vm so that we understand the total amount of
memory that will no longer be mapped to that particular process if killed,
regardless if its a shared library or not, so that we can define vital
tasks and memory leakers from userspace.

> > I understand that KDE is extremely important to your work environment and
> > if you lose it, it seems like a failure of Linux and the VM. However, the
> > kernel cannot possibly know what applications you believe to be the most
> > important. For that reason, userspace is able to tune the badness() score
> > by writing to /proc/pid/oom_adj as I've suggested you do for kdeinit. You
> > have the ability to protect KDE from getting oom killed, you just need to
> > use it.
>
> As already explained, I can't. Besides, I'm not expecting a miracle, I simply
> expect the kernel to kill the process that takes up the most memory, and the
> kernel can possibly know that, it just doesn't do it. What other evidence do
> you want to be shown that badness calculated for two processes on their
> actual memory usage differs by a multiple of 5 or more?
>

This is highly likely related to the child VM sizes being accumulated in
the parent's badness score, we'll have to see how your results vary once
the forkbomb detector is merged. I disagree that we always need to kill
the application consuming the most memory, though, we need to determine
when it's better to simply fail a ~__GFP_NOFAIL allocation and when
killing a smaller, lower priority task may be more beneficial to the
user's work.

> > Right, I believe Kame is working on a forkbomb detector that would replace
> > this logic.
>
> Until then, can we dump the current code? Because I have provided one case
> where it makes things worse and nobody has provided any case where it makes
> things better or any other justification for its existence. There's no point
> in keeping code for which nobody knows how it improves things (in reality,
> not some textbook case).
>

First of all, I don't think the forkbomb detector is that far in the
future since a preliminary version of it was also posted, but I think we
also need a way to address those particular cases in the heuristic. There
are real-life cases where out of memory conditions occur specifically
because of forkbombs so not addressing the issue in the heuristic in some
way is not appropriate.

> And, in case the justification for it is something like "Apache", can we
> fast-forward to my improved suggestion to limit this only to children that
> are forked but not exec()-ed?
>

The amount of memory you'll be freeing in simply killing such tasks will
be minimal, I don't think that's an appropriate heuristic if they all
share their memory with the parent.

2010-02-03 01:56:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Improving OOM killer

On Tue, 2 Feb 2010 17:41:41 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Tue, 2 Feb 2010, Lubos Lunak wrote:
>
> > > > init
> > > > |- kdeinit
> > > > | |- ksmserver
> > > > | | |- kwin
> > > > | |- <other>
> > > > |- konsole
> > > > |- make
> > > > |- sh
> > > > | |- meinproc4
> > > > |- sh
> > > > | |- meinproc4
> > > > |- <etc>
> > > >
> > > > What happens is that OOM killer usually selects either ksmserver (KDE
> > > > session manager) or kdeinit (KDE master process that spawns most KDE
> > > > processes). Note that in either case OOM killer does not reach the point
> > > > of killing the actual offender - it will randomly kill in the tree under
> > > > kdeinit until it decides to kill ksmserver, which means terminating the
> > > > desktop session. As konsole is a KUniqueApplication, it forks into
> > > > background and gets reparented to init, thus getting away from the
> > > > kdeinit subtree. Since the memory pressure is distributed among several
> > > > meinproc4 processes, the badness does not get summed up in its make
> > > > grandparent, as badness() does this only for direct parents.
> > >
> > > There's no randomness involved in selecting a task to kill;
> >
> > That was rather a figure of speech, but even if you want to take it
> > literally, then from the user's point of view it is random. Badness of
> > kdeinit depends on the number of children it has spawned, badness of
> > ksmserver depends for example on the number and size of windows open (as its
> > child kwin is a window and compositing manager).
> >
>
> As I've mentioned, I believe Kame (now cc'd) is working on replacing the
> heuristic that adds the VM size for children into the parent task's
> badness score with a forkbomb detector.

I stopped that as I mentioned. I'm heavily disappointed with myself and
would like not to touch oom-killer things for a while.

I'd like to conentrate on memcg for a while, which I've starved for these 3 months.

Then, you don't need to CC me.

Bye,
-Kame

2010-02-03 02:12:49

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wed, 3 Feb 2010, KAMEZAWA Hiroyuki wrote:

> I stopped that as I mentioned. I'm heavily disappointed with myself and
> would like not to touch oom-killer things for a while.
>
> I'd like to conentrate on memcg for a while, which I've starved for these 3 months.
>
> Then, you don't need to CC me.
>

I'm disappointed to hear that, it would be helpful to get some consensus
on the points that we can all agree on from different perspectives. I'll
try to find some time to develop a solution that people will see as a move
in the positive direction.

On a seperate topic, do you have time to repost your sysctl cleanup for
Andrew from http://marc.info/?l=linux-kernel&m=126457416729672 with the
#ifndef CONFIG_MMU fix I mentioned?

2010-02-03 02:15:50

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Improving OOM killer

On Tue, 2 Feb 2010 18:12:41 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Wed, 3 Feb 2010, KAMEZAWA Hiroyuki wrote:
>
> > I stopped that as I mentioned. I'm heavily disappointed with myself and
> > would like not to touch oom-killer things for a while.
> >
> > I'd like to conentrate on memcg for a while, which I've starved for these 3 months.
> >
> > Then, you don't need to CC me.
> >
>
> I'm disappointed to hear that, it would be helpful to get some consensus
> on the points that we can all agree on from different perspectives. I'll
> try to find some time to develop a solution that people will see as a move
> in the positive direction.
>
> On a seperate topic, do you have time to repost your sysctl cleanup for
> Andrew from http://marc.info/?l=linux-kernel&m=126457416729672 with the
> #ifndef CONFIG_MMU fix I mentioned?
>
I'll not. Feel free to reuse fragments I posted.

Thanks,
-Kame

2010-02-03 02:36:50

by David Rientjes

[permalink] [raw]
Subject: [patch] sysctl: clean up vm related variable declarations

From: KAMEZAWA Hiroyuki <[email protected]>

Now, there are many "extern" declaration in kernel/sysctl.c. "extern"
declaration in *.c file is not appreciated in general.
And Hmm...it seems there are a few redundant declarations.

Because most of sysctl variables are defined in its own header file,
they should be declared in the same style, be done in its own *.h file.

This patch removes some VM(memory management) related sysctl's
variable declaration from kernel/sysctl.c and move them to
proper places.

[[email protected]: #ifdef fixlet]
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
include/linux/mm.h | 5 +++++
include/linux/mmzone.h | 1 +
include/linux/oom.h | 5 +++++
kernel/sysctl.c | 16 ++--------------
mm/mmap.c | 5 +++++
5 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1305,6 +1305,7 @@ int in_gate_area_no_task(unsigned long addr);
#define in_gate_area(task, addr) ({(void)task; in_gate_area_no_task(addr);})
#endif /* __HAVE_ARCH_GATE_AREA */

+extern int sysctl_drop_caches;
int drop_caches_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
@@ -1347,5 +1348,9 @@ extern void shake_page(struct page *p, int access);
extern atomic_long_t mce_bad_pages;
extern int soft_offline_page(struct page *page, int flags);

+#ifndef CONFIG_MMU
+extern int sysctl_nr_trim_pages;
+#endif
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -760,6 +760,7 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
+extern int percpu_pagelist_fraction; /* for sysctl */
int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -43,5 +43,10 @@ static inline void oom_killer_enable(void)
{
oom_killer_disabled = false;
}
+/* for sysctl */
+extern int sysctl_panic_on_oom;
+extern int sysctl_oom_kill_allocating_task;
+extern int sysctl_oom_dump_tasks;
+
#endif /* __KERNEL__*/
#endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -50,6 +50,8 @@
#include <linux/ftrace.h>
#include <linux/slow-work.h>
#include <linux/perf_event.h>
+#include <linux/mman.h>
+#include <linux/oom.h>

#include <asm/uaccess.h>
#include <asm/processor.h>
@@ -66,11 +68,6 @@
/* External variables not in a header file. */
extern int C_A_D;
extern int print_fatal_signals;
-extern int sysctl_overcommit_memory;
-extern int sysctl_overcommit_ratio;
-extern int sysctl_panic_on_oom;
-extern int sysctl_oom_kill_allocating_task;
-extern int sysctl_oom_dump_tasks;
extern int max_threads;
extern int core_uses_pid;
extern int suid_dumpable;
@@ -79,14 +76,9 @@ extern unsigned int core_pipe_limit;
extern int pid_max;
extern int min_free_kbytes;
extern int pid_max_min, pid_max_max;
-extern int sysctl_drop_caches;
-extern int percpu_pagelist_fraction;
extern int compat_log;
extern int latencytop_enabled;
extern int sysctl_nr_open_min, sysctl_nr_open_max;
-#ifndef CONFIG_MMU
-extern int sysctl_nr_trim_pages;
-#endif
#ifdef CONFIG_RCU_TORTURE_TEST
extern int rcutorture_runnable;
#endif /* #ifdef CONFIG_RCU_TORTURE_TEST */
@@ -197,10 +189,6 @@ extern struct ctl_table inotify_table[];
extern struct ctl_table epoll_table[];
#endif

-#ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
-int sysctl_legacy_va_layout;
-#endif
-
extern int prove_locking;
extern int lock_stat;

diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -87,6 +87,11 @@ int sysctl_overcommit_ratio = 50; /* default is 50% */
int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
struct percpu_counter vm_committed_as;

+#ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
+/* Used by each architecture's private code and sysctl. */
+int sysctl_legacy_va_layout;
+#endif
+
/*
* Check that a process has enough memory to allocate a new virtual
* mapping. 0 means there is enough memory for the allocation to

2010-02-03 07:50:25

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Improving OOM killer

> =====
> --- linux-2.6.31/mm/oom_kill.c.sav 2010-02-01 22:00:41.614838540 +0100
> +++ linux-2.6.31/mm/oom_kill.c 2010-02-01 22:01:08.773757932 +0100
> @@ -69,7 +69,7 @@ unsigned long badness(struct task_struct
> /*
> * The memory size of the process is the basis for the badness.
> */
> - points = mm->total_vm;
> + points = get_mm_rss(mm);
>
> /*
> * After this unlock we can no longer dereference local variable `mm'
> @@ -83,21 +83,6 @@ unsigned long badness(struct task_struct
> return ULONG_MAX;
>
> /*
> - * Processes which fork a lot of child processes are likely
> - * a good choice. We add half the vmsize of the children if they
> - * have an own mm. This prevents forking servers to flood the
> - * machine with an endless amount of children. In case a single
> - * child is eating the vast majority of memory, adding only half
> - * to the parents will make the child our kill candidate of choice.
> - */
> - list_for_each_entry(child, &p->children, sibling) {
> - task_lock(child);
> - if (child->mm != mm && child->mm)
> - points += child->mm->total_vm/2 + 1;
> - task_unlock(child);
> - }
> -
> - /*
> * CPU time is in tens of seconds and run time is in thousands
> * of seconds. There is no particular reason for this other than
> * that it turned out to work very well in practice.
> =====
>
> In other words, use VmRSS for measuring memory usage instead of VmSize, and
> remove child accumulating.
>
> I hope the above is good enough reason for the first change. VmSize includes
> things like read-only mappings, memory mappings that is actually unused,
> mappings backed by a file, mappings from video drivers, and so on. VmRSS is
> actual real memory used, which is what mostly matters here. While it may not
> be perfect, it is certainly an improvement.
>
> The second change should be done on the basis that it does more harm than
> good. In this specific case, it does not help to identify the source of the
> problem, and it incorrectly identifies kdeinit as the problem solely on the
> basis that it spawned many other processes. I think it's already quite hinted
> that this is a problem by the fact that you had to add a special protection
> for init - any session manager, process launcher or even xterm used for
> launching apps is yet another init.
>
> I also have problems finding a case where the child accounting would actually
> help. I mean, in practice, I can certainly come up with something in theory,
> and this looks to me like a solution to a very synthesized problem. In which
> realistic case will one process launch a limited number of children, where
> all of them will consume memory, but just killing the children one by one
> won't avoid the problem reasonably? This is unlikely to avoid a forkbomb, as
> in that case the number of children will be the problem. It is not necessary
> for just one children misbehaving and being restarted, nor will it work
> there. So what is that supposed to fix, and is it more likely than the case
> of a process launching several unrelated children?
>
> If the children accounting is supposed to handle cases like forked children
> of Apache, then I suggest it is adjusted only to count children that have
> been forked from the parent but there has been no exec(). I'm afraid I don't
> know how to detect that.
>
>
> When running a kernel with these changes applied, I can safely do the
> above-described case of running parallel doc generation in KDE. No clearly
> innocent process is selected for killing, the first choice is always an
> offender.
>
> Moreover, the remedy is almost instant, there is only a fraction of second of
> when the machine is overloaded by the I/O of swapping pages in and out (I do
> not use swap, but there is a large amount of memory used by read-only
> mappings of binaries, libraries or various other files that is in the
> original case rendering the machine unresponsive - I assume this is because
> the kernel tries to kill an innocent process, but the offenders immediatelly
> consume anything that is freed, requiring even memory used by code that is to
> be executed to be swapped in from files again).
>
> I consider the patches to be definite improvements, so if they are ok, I will
> format them as necessary. Now, what is the catch?

Personally, I think your use case represent to typical desktop and Linux
have to works fine on typical desktop use-case. /proc/pid/oom_adj never fit
desktop use-case. In past discussion, I'v agreed with much people. but I haven't
reach to agree with David Rientjes about this topic.

If you want to merge this patch, you need persuade him. I can't help you. sorry.



2010-02-03 08:07:13

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch] sysctl: clean up vm related variable declarations

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, there are many "extern" declaration in kernel/sysctl.c. "extern"
> declaration in *.c file is not appreciated in general.
> And Hmm...it seems there are a few redundant declarations.
>
> Because most of sysctl variables are defined in its own header file,
> they should be declared in the same style, be done in its own *.h file.
>
> This patch removes some VM(memory management) related sysctl's
> variable declaration from kernel/sysctl.c and move them to
> proper places.
>
> [[email protected]: #ifdef fixlet]
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>

Looks good to me.
Reviewed-by: KOSAKI Motohiro <[email protected]>


2010-02-03 08:17:48

by Balbir Singh

[permalink] [raw]
Subject: Re: [patch] sysctl: clean up vm related variable declarations

* David Rientjes <[email protected]> [2010-02-02 18:36:42]:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, there are many "extern" declaration in kernel/sysctl.c. "extern"
> declaration in *.c file is not appreciated in general.
> And Hmm...it seems there are a few redundant declarations.
>
> Because most of sysctl variables are defined in its own header file,
> they should be declared in the same style, be done in its own *.h file.
>
> This patch removes some VM(memory management) related sysctl's
> variable declaration from kernel/sysctl.c and move them to
> proper places.
>
> [[email protected]: #ifdef fixlet]
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> include/linux/mm.h | 5 +++++
> include/linux/mmzone.h | 1 +
> include/linux/oom.h | 5 +++++
> kernel/sysctl.c | 16 ++--------------
> mm/mmap.c | 5 +++++
> 5 files changed, 18 insertions(+), 14 deletions(-)
>

Looks good to me

--
Balbir

2010-02-03 08:57:23

by Balbir Singh

[permalink] [raw]
Subject: Re: Improving OOM killer

* Lubos Lunak <[email protected]> [2010-02-01 23:02:37]:

>
> Hello,
>
> I'd like to suggest some changes to the OOM killer code that I believe
> improve its behaviour - here it turns it from something completely useless and
> harmful into something that works very well. I'm first posting changes for
> review, I'll reformat the patch later as needed if approved. Given that it's
> been like this for about a decade, I get the feeling there must be some
> strange catch.
>
> My scenario is working in a KDE desktop session and accidentally running
> parallel make in doc/ subdirectory of sources of a KDE module. As I use
> distributed compiling, I run make with -j20 or more, but, as the tool used for
> processing KDE documentation is quite memory-intensive, running this many
> of them is more than enough to consume all the 2GB RAM in the machine. What
> happens in that case is that the machine becomes virtually unresponsible,
> where even Ctrl+Alt+F1 can take minutes, not to mention some action that'd
> actually redeem the situation. If I wait long enough for something to happen,
> which can be even hours, the action that ends the situation is killing one of
> the most vital KDE processes, rendering the whole session useless and making
> me lose all unsaved data.
>
> The process tree looks roughly like this:
>
> init
> |- kdeinit
> | |- ksmserver
> | | |- kwin
> | |- <other>
> |- konsole
> |- make
> |- sh
> | |- meinproc4
> |- sh
> | |- meinproc4
> |- <etc>
>
> What happens is that OOM killer usually selects either ksmserver (KDE session
> manager) or kdeinit (KDE master process that spawns most KDE processes). Note
> that in either case OOM killer does not reach the point of killing the actual
> offender - it will randomly kill in the tree under kdeinit until it decides
> to kill ksmserver, which means terminating the desktop session. As konsole is
> a KUniqueApplication, it forks into background and gets reparented to init,
> thus getting away from the kdeinit subtree. Since the memory pressure is
> distributed among several meinproc4 processes, the badness does not get
> summed up in its make grandparent, as badness() does this only for direct
> parents.
>
> Each meinproc4 process still uses a considerable amount of memory, so one
> could assume that the situation would be solved by simply killing them one by
> one, but it is not so because of using what I consider poor metric for
> measuring memory usage - VmSize. VmSize, if I'm not mistaken, is the size of
> the address space taken by the process, which in practice does not say much
> about how much memory the process actually uses. For example, /proc/*/status
> for one selected KDE process:
>
> VmPeak: 534676 kB
> VmSize: 528340 kB
> VmLck: 0 kB
> VmHWM: 73464 kB
> VmRSS: 73388 kB
> VmData: 142332 kB
> VmStk: 92 kB
> VmExe: 44 kB
> VmLib: 91232 kB
> VmPTE: 716 kB
>
> And various excerpts from /proc/*/smaps for this process:
> ...
> 7f7b3f800000-7f7b40000000 rwxp 00000000 00:00 0
> Size: 8192 kB
> Rss: 16 kB
> Referenced: 16 kB
> ...
> 7f7b40055000-7f7b44000000 ---p 00000000 00:00 0
> Size: 65196 kB
> Rss: 0 kB
> Referenced: 0 kB
> ...
> 7f7b443cd000-7f7b445cd000 ---p 0001c000 08:01
> 790267 /usr/lib64/kde4/libnsplugin.so
> Size: 2048 kB
> Rss: 0 kB
> Referenced: 0 kB
> ...
> 7f7b48300000-7f7b4927d000 rw-s 00000000 08:01
> 58690 /var/tmp/kdecache-seli/kpc/kde-icon-cache.data
> Size: 15860 kB
> Rss: 24 kB
> Referenced: 24 kB
>
> I assume the first one is stack, search me what the second and third ones are
> (there appears to be one such mapping as the third one for each .so used),
> the last one is a mapping of a large cache file that's nevertheless rarely
> used extensively and even then it's backed by a file. In other words, none of
> this actually uses much of real memory, yet right now it's the process that
> would get killed for using about 70MB memory, even though it's not the
> offender. The offender scores only about 1/3 of its badness, even though it
> uses almost the double amount of memory:
>
> VmPeak: 266508 kB
> VmSize: 266504 kB
> VmLck: 0 kB
> VmHWM: 118208 kB
> VmRSS: 118208 kB
> VmData: 98512 kB
> VmStk: 84 kB
> VmExe: 60 kB
> VmLib: 48944 kB
> VmPTE: 536 kB
>
> And the offender is only 14th in the list of badness candidates. Speaking of
> which, the following is quite useful for seeing all processes sorted by
> badness:
>
> ls /proc/*/oom_score | grep -v self | sed 's/\(.*\)\/\(.*\)/echo -n "\1 "; \
> echo -n "`cat \1\/\2 2>\/dev\/null` "; readlink \1\/exe || echo/'| sh | \
> sort -nr +1
>
>
> Therefore, I suggest doing the following changes in mm/oom_kill.c :
>
> =====
> --- linux-2.6.31/mm/oom_kill.c.sav 2010-02-01 22:00:41.614838540 +0100
> +++ linux-2.6.31/mm/oom_kill.c 2010-02-01 22:01:08.773757932 +0100
> @@ -69,7 +69,7 @@ unsigned long badness(struct task_struct
> /*
> * The memory size of the process is the basis for the badness.
> */
> - points = mm->total_vm;
> + points = get_mm_rss(mm);
>
> /*
> * After this unlock we can no longer dereference local variable `mm'
> @@ -83,21 +83,6 @@ unsigned long badness(struct task_struct
> return ULONG_MAX;
>
> /*
> - * Processes which fork a lot of child processes are likely
> - * a good choice. We add half the vmsize of the children if they
> - * have an own mm. This prevents forking servers to flood the
> - * machine with an endless amount of children. In case a single
> - * child is eating the vast majority of memory, adding only half
> - * to the parents will make the child our kill candidate of choice.
> - */
> - list_for_each_entry(child, &p->children, sibling) {
> - task_lock(child);
> - if (child->mm != mm && child->mm)
> - points += child->mm->total_vm/2 + 1;
> - task_unlock(child);
> - }
> -
> - /*
> * CPU time is in tens of seconds and run time is in thousands
> * of seconds. There is no particular reason for this other than
> * that it turned out to work very well in practice.
> =====
>
> In other words, use VmRSS for measuring memory usage instead of VmSize, and
> remove child accumulating.

I am not sure of the impact of changing to RSS, although I've
personally believed that RSS based accounting is where we should go,
but we need to consider the following

1. Total VM provides data about potentially swapped pages, overcommit,
etc.
2. RSS alone is not sufficient, RSS does not account for shared pages,
so we ideally need something like PSS.

You could potentially use the memory resource controller for
isolation, but that is a way of working around and parititioning the
system, there are other workarounds like oom-adjust, etc.

I suspect the correct answer would depend on our answers to 1 and 2
and a lot of testing with any changes made.

--
Balbir

2010-02-03 09:41:12

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wed, 3 Feb 2010, KOSAKI Motohiro wrote:

> Personally, I think your use case represent to typical desktop and Linux
> have to works fine on typical desktop use-case. /proc/pid/oom_adj never fit
> desktop use-case. In past discussion, I'v agreed with much people. but I haven't
> reach to agree with David Rientjes about this topic.
>

Which point don't you agree with? I've agreed that heuristic needs to be
changed and since Kame has decided to abandon his oom killer work, I said
that I would find time to develop a solution that would be based on
consensus. I don't think that simply replacing the baseline with rss,
rendering oom_adj practically useless for any other purpose other than
polarizing priorities, and removing any penalty for tasks that fork an
egregious amount of tasks is acceptable to all parties, though.

When a desktop system runs a vital task that, at all costs, cannot
possibly be oom killed such as KDE from the user perspective, is it really
that outrageous of a request to set it to OOM_DISABLE? No, it's not.
There are plenty of open source examples of applications that tune their
own oom_adj values for that reason; userspace input into the oom killer's
heuristic will always be an integral part of its function.

I believe that we can reach consensus without losing the existing
functionality that oom_adj provides, namely defining vital system tasks
and memory leakers, and without this all or nothing type attitude that
insists we either go with rss as a baseline because "it doesn't select X
first in my particular example" or you'll just take your ball and go home.

2010-02-03 12:10:36

by Lubos Lunak

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wednesday 03 of February 2010, Balbir Singh wrote:
> * Lubos Lunak <[email protected]> [2010-02-01 23:02:37]:
> > In other words, use VmRSS for measuring memory usage instead of VmSize,
> > and remove child accumulating.
>
> I am not sure of the impact of changing to RSS, although I've
> personally believed that RSS based accounting is where we should go,
> but we need to consider the following
>
> 1. Total VM provides data about potentially swapped pages,

Yes, I've already updated my proposal in another mail to switch from VmSize
to VmRSS+InSwap. I don't know how to find out the second item in code, but at
this point of discussion that's just details.

> overcommit,

I don't understand how this matters. Overcommit is memory for which address
space has been allocated but not actual memory, right? Then that's exactly
what I'm claiming is wrong and am trying to reverse. Currently OOM killer
takes this into account because it uses VmSize, but IMO it shouldn't - if a
process does malloc(400M) but then it uses only a tiny fraction of that, in
the case of memory shortage killing that process does not solve anything in
practice.

> etc.
> 2. RSS alone is not sufficient, RSS does not account for shared pages,
> so we ideally need something like PSS.

Just to make sure I understand what you mean with "RSS does not account for
shared pages" - you say that if a page is shared by 4 processes, then when
calculating badness for them, only 1/4 of the page should be counted for
each? Yes, I suppose so, that makes sense. That's more like fine-tunning at
this point though, as long as there's no agreement that moving away from
VmSize is an improvement.

> I suspect the correct answer would depend on our answers to 1 and 2
> and a lot of testing with any changes made.

Testing - are there actually any tests for it, or do people just test random
scenarios when they do changes? Also, I'm curious, what areas is the OOM
killer actually generally known to work well in? I somehow get the feeling
from the discussion here that people just tweak oom_adj until it works for
them.

--
Lubos Lunak
openSUSE Boosters team, KDE developer
[email protected] , [email protected]

2010-02-03 12:25:36

by Balbir Singh

[permalink] [raw]
Subject: Re: Improving OOM killer

* Lubos Lunak <[email protected]> [2010-02-03 13:10:27]:

> On Wednesday 03 of February 2010, Balbir Singh wrote:
> > * Lubos Lunak <[email protected]> [2010-02-01 23:02:37]:
> > > In other words, use VmRSS for measuring memory usage instead of VmSize,
> > > and remove child accumulating.
> >
> > I am not sure of the impact of changing to RSS, although I've
> > personally believed that RSS based accounting is where we should go,
> > but we need to consider the following
> >
> > 1. Total VM provides data about potentially swapped pages,
>
> Yes, I've already updated my proposal in another mail to switch from VmSize
> to VmRSS+InSwap. I don't know how to find out the second item in code, but at
> this point of discussion that's just details.
>

I am yet to catch up with the rest of the thread. Thanks for heads up.

> > overcommit,
>
> I don't understand how this matters. Overcommit is memory for which address
> space has been allocated but not actual memory, right? Then that's exactly
> what I'm claiming is wrong and am trying to reverse. Currently OOM killer
> takes this into account because it uses VmSize, but IMO it shouldn't - if a
> process does malloc(400M) but then it uses only a tiny fraction of that, in
> the case of memory shortage killing that process does not solve anything in
> practice.

We have a way of tracking commmitted address space, which is more
sensible than just allocating memory and is used for tracking
overcommit. I was suggesting that, that might be a better approach.

>
> > etc.
> > 2. RSS alone is not sufficient, RSS does not account for shared pages,
> > so we ideally need something like PSS.
>
> Just to make sure I understand what you mean with "RSS does not account for
> shared pages" - you say that if a page is shared by 4 processes, then when
> calculating badness for them, only 1/4 of the page should be counted for
> each? Yes, I suppose so, that makes sense.

Yes, that is what I am speaking of

> That's more like fine-tunning at
> this point though, as long as there's no agreement that moving away from
> VmSize is an improvement.
>

There is no easy way to calculate the Pss today without walking the
page tables, but some simplification there will make it a better and a
more accurate metric.

> > I suspect the correct answer would depend on our answers to 1 and 2
> > and a lot of testing with any changes made.
>
> Testing - are there actually any tests for it, or do people just test random
> scenarios when they do changes? Also, I'm curious, what areas is the OOM
> killer actually generally known to work well in? I somehow get the feeling
> from the discussion here that people just tweak oom_adj until it works for
> them.
>

I've mostly found OOM killer to work well for me, but looking at the
design and our discussions I know there need to be certain improvements.

--
Balbir

2010-02-03 14:49:40

by Rik van Riel

[permalink] [raw]
Subject: Re: Improving OOM killer

On 02/01/2010 05:02 PM, Lubos Lunak wrote:

> In other words, use VmRSS for measuring memory usage instead of VmSize, and
> remove child accumulating.

I agree with removing the child accumulating code. That code can
do a lot of harm with databases like postgresql, or cause the
system's main service (eg. httpd) to be killed when a broken cgi
script used up too much memory.

IIRC the child accumulating code was introduced to deal with
malicious code (fork bombs), but it makes things worse for the
(much more common) situation of a system without malicious
code simply running out of memory due to being very busy.

I have no strong opinion on using RSS vs VmSize.

--
All rights reversed.

2010-02-03 15:01:12

by Minchan Kim

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wed, 2010-02-03 at 17:55 +0530, Balbir Singh wrote:
> * Lubos Lunak <[email protected]> [2010-02-03 13:10:27]:
>
> > On Wednesday 03 of February 2010, Balbir Singh wrote:
> > > * Lubos Lunak <[email protected]> [2010-02-01 23:02:37]:
> > > > In other words, use VmRSS for measuring memory usage instead of VmSize,
> > > > and remove child accumulating.
> > >
> > > I am not sure of the impact of changing to RSS, although I've
> > > personally believed that RSS based accounting is where we should go,
> > > but we need to consider the following
> > >
> > > 1. Total VM provides data about potentially swapped pages,
> >
> > Yes, I've already updated my proposal in another mail to switch from VmSize
> > to VmRSS+InSwap. I don't know how to find out the second item in code, but at
> > this point of discussion that's just details.
> >

We have swap count with mm-count-swap-usage.patch by Kame in mmtom.

> I am yet to catch up with the rest of the thread. Thanks for heads up.
>
> > > overcommit,
> >
> > I don't understand how this matters. Overcommit is memory for which address
> > space has been allocated but not actual memory, right? Then that's exactly
> > what I'm claiming is wrong and am trying to reverse. Currently OOM killer
> > takes this into account because it uses VmSize, but IMO it shouldn't - if a
> > process does malloc(400M) but then it uses only a tiny fraction of that, in
> > the case of memory shortage killing that process does not solve anything in
> > practice.
>
> We have a way of tracking commmitted address space, which is more
> sensible than just allocating memory and is used for tracking
> overcommit. I was suggesting that, that might be a better approach.

Yes. It does make sense. At least total_vm doesn't care about
MAP_NORESERVE case. But unfortunately, it's a per CPU not per Process.

>
> >
> > > etc.
> > > 2. RSS alone is not sufficient, RSS does not account for shared pages,
> > > so we ideally need something like PSS.
> >
> > Just to make sure I understand what you mean with "RSS does not account for
> > shared pages" - you say that if a page is shared by 4 processes, then when
> > calculating badness for them, only 1/4 of the page should be counted for
> > each? Yes, I suppose so, that makes sense.
>
> Yes, that is what I am speaking of

I agree. If we want to make RSS with base of badness, it's one of things
we have to solve.


--
Kind regards,
Minchan Kim

2010-02-03 16:06:34

by Minchan Kim

[permalink] [raw]
Subject: Re: Improving OOM killer

On Thu, 2010-02-04 at 00:00 +0900, Minchan Kim wrote:
> On Wed, 2010-02-03 at 17:55 +0530, Balbir Singh wrote:
> > * Lubos Lunak <[email protected]> [2010-02-03 13:10:27]:
> >> > I don't understand how this matters. Overcommit is memory for which address
> > > space has been allocated but not actual memory, right? Then that's exactly
> > > what I'm claiming is wrong and am trying to reverse. Currently OOM killer
> > > takes this into account because it uses VmSize, but IMO it shouldn't - if a
> > > process does malloc(400M) but then it uses only a tiny fraction of that, in
> > > the case of memory shortage killing that process does not solve anything in
> > > practice.
> >
> > We have a way of tracking commmitted address space, which is more
> > sensible than just allocating memory and is used for tracking
> > overcommit. I was suggesting that, that might be a better approach.
>
> Yes. It does make sense. At least total_vm doesn't care about
> MAP_NORESERVE case. But unfortunately, it's a per CPU not per Process.

Sorry for confusing. It was opposite. I slept :)
The commited as doesn't care about MAP_NORESERVE case.
But it definitely charges memory. so I think total_vm is better than
committed as if we really have to use vmsize heuristic continuously.

But I am not sure that i understand your point about overcommit policy.


--
Kind regards,
Minchan Kim

2010-02-03 17:01:35

by Balbir Singh

[permalink] [raw]
Subject: Re: Improving OOM killer

* Rik van Riel <[email protected]> [2010-02-03 09:49:18]:

> On 02/01/2010 05:02 PM, Lubos Lunak wrote:
>
> > In other words, use VmRSS for measuring memory usage instead of VmSize, and
> >remove child accumulating.
>
> I agree with removing the child accumulating code. That code can
> do a lot of harm with databases like postgresql, or cause the
> system's main service (eg. httpd) to be killed when a broken cgi
> script used up too much memory.
>
> IIRC the child accumulating code was introduced to deal with
> malicious code (fork bombs), but it makes things worse for the
> (much more common) situation of a system without malicious
> code simply running out of memory due to being very busy.
>

For fork bombs, we could do a number of children number test and have
a threshold before we consider a process and its children for
badness().

> I have no strong opinion on using RSS vs VmSize.
>

David commented and feels strongly about RSS and prefers VmSize.

--
Balbir

2010-02-03 18:58:13

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wed, 3 Feb 2010, Balbir Singh wrote:

> > IIRC the child accumulating code was introduced to deal with
> > malicious code (fork bombs), but it makes things worse for the
> > (much more common) situation of a system without malicious
> > code simply running out of memory due to being very busy.
> >
>
> For fork bombs, we could do a number of children number test and have
> a threshold before we consider a process and its children for
> badness().
>

Yes, we could look for the number of children with seperate mm's and then
penalize those threads that have forked an egregious amount, say, 500
tasks. I think we should check for this threshold within the badness()
heuristic to identify such forkbombs and not limit it only to certain
applications.

My rewrite for the badness() heuristic is centered on the idea that scores
should range from 0 to 1000, 0 meaning "never kill this task" and 1000
meaning "kill this task first." The baseline for a thread, p, may be
something like this:

unsigned int badness(struct task_struct *p,
unsigned long totalram)
{
struct task_struct *child;
struct mm_struct *mm;
int forkcount = 0;
long points;

task_lock(p);
mm = p->mm;
if (!mm) {
task_unlock(p);
return 0;
}
points = (get_mm_rss(mm) +
get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
totalram;
task_unlock(p);

list_for_each_entry(child, &p->children, sibling)
/* No lock, child->mm won't be dereferenced */
if (child->mm && child->mm != mm)
forkcount++;

/* Forkbombs get penalized 10% of available RAM */
if (forkcount > 500)
points += 100;

...

/*
* /proc/pid/oom_adj ranges from -1000 to +1000 to either
* completely disable oom killing or always prefer it.
*/
points += p->signal->oom_adj;

if (points < 0)
return 0;
return (points <= 1000) ? points : 1000;
}

static struct task_struct *select_bad_process(...,
nodemask_t *nodemask)
{
struct task_struct *p;
unsigned long totalram = 0;
int nid;

for_each_node_mask(nid, nodemask)
totalram += NODE_DATA(nid)->node_present_pages;

for_each_process(p) {
unsigned int points;

...

if (!nodes_intersects(p->mems_allowed, nodemasks))
continue;

...
points = badness(p, totalram);
...
}
...
}

In this example, /proc/pid/oom_adj now ranges from -1000 to +1000, with
OOM_DISABLE being -1000, to polarize tasks for oom killing or determine
when a task is leaking memory because it is using far more memory than it
should. The nodemask passed from the page allocator should be intersected
with current->mems_allowed within the oom killer; userspace is then fully
aware of what value is an egregious amount of RAM for a task to consume,
including information it knows about the task's cpuset or mempolicy. For
example, it would be very simple for a user to set an oom_adj of -500,
which means "we discount 50% of the task's allowed memory from being
considered in the heuristic" or +500, which means "we always allow all
other system/cpuset/mempolicy tasks to use at least 50% more allowed
memory than this one."

2010-02-03 19:43:45

by Frans Pop

[permalink] [raw]
Subject: Re: Improving OOM killer

David Rientjes wrote:
> /*
> * /proc/pid/oom_adj ranges from -1000 to +1000 to either
> * completely disable oom killing or always prefer it.
> */
> points += p->signal->oom_adj;
>

Wouldn't that cause a rather huge compatibility issue given that the
current oom_adj works in a totally different way:

! 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
! ------------------------------------------------------
! This file can be used to adjust the score used to select which processes
! should be killed in an out-of-memory situation. Giving it a high score
! will increase the likelihood of this process being killed by the
! oom-killer. Valid values are in the range -16 to +15, plus the special
! value -17, which disables oom-killing altogether for this process.

?

Cheers,
FJP

2010-02-03 19:52:35

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wed, 3 Feb 2010, Frans Pop wrote:

> > * /proc/pid/oom_adj ranges from -1000 to +1000 to either
> > * completely disable oom killing or always prefer it.
> > */
> > points += p->signal->oom_adj;
> >
>
> Wouldn't that cause a rather huge compatibility issue given that the
> current oom_adj works in a totally different way:
>
> ! 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
> ! ------------------------------------------------------
> ! This file can be used to adjust the score used to select which processes
> ! should be killed in an out-of-memory situation. Giving it a high score
> ! will increase the likelihood of this process being killed by the
> ! oom-killer. Valid values are in the range -16 to +15, plus the special
> ! value -17, which disables oom-killing altogether for this process.
>
> ?
>

I thought about whether we'd need an additional, complementary tunable
such as /proc/pid/oom_bias that would effect this new memory-charging bias
in the heuristic. It could be implemented so that writing to oom_adj
would clear oom_bias and vice versa.

Although that would certainly be possible, I didn't propose it for a
couple of reasons:

- it would clutter the space to have two seperate tunables when the
metrics that /proc/pid/oom_adj uses has become obsolete by the new
baseline as a fraction of total RAM, and

- we have always exported OOM_DISABLE, OOM_ADJUST_MIN, and OOM_ADJUST_MAX
via include/oom.h so that userspace should use them sanely. Setting
a particular oom_adj value for anything other than OOM_DISABLE means
the score will be relative to other system tasks, so its a value that
is typically calibrated at runtime rather than static, hardcoded
values.

We could reuse /proc/pid/oom_adj for the new heuristic by severely
reducing its granularity than it otherwise would by doing
(oom_adj * 1000 / OOM_ADJUST_MAX), but that will eventually become
annoying and much more difficult to document.

Given your citation, I don't think we've ever described /proc/pid/oom_adj
outside of the implementation as a bitshift, either. So its use right now
for anything other than OOM_DISABLE is probably based on scalar thinking.

2010-02-03 20:12:37

by Frans Pop

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wednesday 03 February 2010, David Rientjes wrote:
> - we have always exported OOM_DISABLE, OOM_ADJUST_MIN, and
> OOM_ADJUST_MAX via include/oom.h so that userspace should use them
> sanely. Setting a particular oom_adj value for anything other than
> OOM_DISABLE means the score will be relative to other system tasks, so
> its a value that is typically calibrated at runtime rather than static,
> hardcoded values.

That doesn't take into account:
- applications where the oom_adj value is hardcoded to a specific value
(for whatever reason)
- sysadmin scripts that set oom_adj from the console

I would think that oom_adj is a documented part of the userspace ABI and
that the change you propose does not fit the normal backwards
compatibility requirements for exposed tunables.

I think that at least any user who's currently setting oom_adj to -17 has a
right to expect that to continue to mean "oom killer disabled". And for
any other value they should get a similar impact to the current impact,
and not one that's reduced by a factor 66.

> We could reuse /proc/pid/oom_adj for the new heuristic by severely
> reducing its granularity than it otherwise would by doing
> (oom_adj * 1000 / OOM_ADJUST_MAX), but that will eventually become
> annoying and much more difficult to document.

Probably quite true, but maybe unavoidable if one accepts the above.

But I'll readily admit I'm not the final authority on this.

Cheers,
FJP

2010-02-03 20:26:28

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wed, 3 Feb 2010, Frans Pop wrote:

> That doesn't take into account:
> - applications where the oom_adj value is hardcoded to a specific value
> (for whatever reason)
> - sysadmin scripts that set oom_adj from the console
>

The fundamentals are the same: negative values mean the task is less
likely to be preferred and positive values mean the task is more likely,
only the scale is different. That scale is exported by the kernel via
OOM_ADJUST_MIN and OOM_ADJUST_MAX and has been since 2006. I don't think
we need to preserve legacy applications or scripts that use hardcoded
values without importing linux/oom.h.

> I would think that oom_adj is a documented part of the userspace ABI and
> that the change you propose does not fit the normal backwards
> compatibility requirements for exposed tunables.
>

The range is documented (but it should have been documented as being from
OOM_ADJUST_MIN to OOM_ADJUST_MAX) but its implementation as a bitshift is
not; it simply says that positive values mean the task is more preferred
and negative values mean it is less preferred. Those semantics are
preserved.

> I think that at least any user who's currently setting oom_adj to -17 has a
> right to expect that to continue to mean "oom killer disabled". And for
> any other value they should get a similar impact to the current impact,
> and not one that's reduced by a factor 66.
>

If the baseline changes as we all agree it needs to such that oom_adj no
longer represents the same thing it did in the first place (it would
become a linear bias), I think this breakage is actually beneficial.
Users will now be able to tune their oom_adj values based on a fraction of
system memory to bias their applications either preferrably or otherwise.

I think we should look at Linux over the next couple of years and decide
if we want to be married to the current semantics of oom_adj that are
going to change (as it would require being a factor of 66, as you
mentioned) when the implementation it was designed for has vanished.

2010-02-03 21:22:19

by Lubos Lunak

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wednesday 03 of February 2010, Balbir Singh wrote:
> * Lubos Lunak <[email protected]> [2010-02-03 13:10:27]:
> > On Wednesday 03 of February 2010, Balbir Singh wrote:
> > > 2. RSS alone is not sufficient, RSS does not account for shared pages,
> > > so we ideally need something like PSS.
> >
> > Just to make sure I understand what you mean with "RSS does not account
> > for shared pages" - you say that if a page is shared by 4 processes, then
> > when calculating badness for them, only 1/4 of the page should be counted
> > for each? Yes, I suppose so, that makes sense.
>
> Yes, that is what I am speaking of
>
> > That's more like fine-tunning at
> > this point though, as long as there's no agreement that moving away from
> > VmSize is an improvement.
>
> There is no easy way to calculate the Pss today without walking the
> page tables, but some simplification there will make it a better and a
> more accurate metric.

OOM should be a rare situation, so doing a little amount of counting
shouldn't be a big deal. Especially if the machine is otherwise busy waiting
for the HDD paging stuff out and in again and has plenty of CPU time to
waste.

--
Lubos Lunak
openSUSE Boosters team, KDE developer
[email protected] , [email protected]

2010-02-03 22:55:06

by Lubos Lunak

[permalink] [raw]
Subject: Re: Improving OOM killer


Given that the badness() proposal I see in your another mail uses
get_mm_rss(), I take it that you've meanwhile changed your mind on the VmSize
vs VmRSS argument and considered that argument irrelevant now. I will comment
only on the suggested use of oom_adj on the desktop here. And actually I hope
that if something reasonably similar to your badness() proposal replaces the
current one it will make any use of oom_adj not needed on the desktop in the
usual case, so this may be irrelevant as well.

On Wednesday 03 of February 2010, David Rientjes wrote:
> On Tue, 2 Feb 2010, Lubos Lunak wrote:
> > Not that it really matters - the net result is that OOM killer usually
> > decides to kill kdeinit or ksmserver, starts killing their children,
> > vital KDE processes, and since the offenders are not among them, it ends
> > up either terminating the whole session by killing ksmserver or killing
> > enough vital processes there to free enough memory for the offenders to
> > finish their work cleanly.
>
> The kernel cannot possibly know what you consider a "vital" process, for
> that understanding you need to tell it using the very powerful
> /proc/pid/oom_adj tunable. I suspect if you were to product all of
> kdeinit's children by patching it to be OOM_DISABLE so that all threads it
> forks will inherit that value you'd actually see much improved behavior.

No. Almost everything in KDE is spawned by kdeinit, so everything would get
the adjustment, which means nothing would in practice get the adjustment.

> I'd also encourage you to talk to the KDE developers to ensure that proper
> precautions are taken to protect it in such conditions since people who
> use such desktop environments typically don't want them to be sacrificed
> for memory.

I am a KDE developer, it's written in my signature. And I've already talked
enough to the KDE developer who has done the oom_adj code that's already
there, as that's also me. I don't know kernel internals, but that doesn't
mean I'm completely clueless about the topic of the discussion I've started.

> > Worse, it worked for about a year or two and now it has only shifted the
> > problem elsewhere and that's it. We now protect kdeinit, which means the
> > OOM killer's choice will very likely ksmserver then. Ok, so let's say now
> > we start protecting also ksmserver, that's some additional hassle setting
> > it up, but that's doable. Now there's a good chance the OOM killer's
> > choice will be kwin (as a compositing manager it can have quite large
> > mappings because of graphics drivers). So ok, we need to protect the
> > window manager, but since that's not a hardcoded component like
> > ksmserver, that's even more hassle.
>
> No, you don't need to protect every KDE process from the oom killer unless
> it is going to be a contender for selection. You could certainly do so
> for completeness, but it shouldn't be required unless the nature of the
> thread demands it such that it forks many vital tasks (kdeinit) or its
> first-generation children's memory consumption can't be known either
> because it depends on how many children it can fork or their memory
> consumption is influenced by the user's work.

1) I think you missed that I said that every KDE application with the current
algorithm can be potentially a contender for selection, and I provided
numbers to demonstrate that in a selected case. Just because such application
is not vital does not mean it's good for it to get killed instead of an
obvious offender.

2) You probably do not realize the complexity involved in using oom_adj in a
desktop. Even when doing that manually I would have some difficulty finding
the right setup for my own desktop use. It'd be probably virtually impossible
to write code that would do it at least somewhat right with all the widely
differing various desktop setups that dynamically change.

3) oom_adj is ultimately just a kludge to handle special cases where the
heuristic doesn't get it right for whatever strange reason. But even you
yourself in another mail presented a heuristic that I believe would make any
use of oom_adj on the desktop unnecessary in the usual cases. The usual
desktop is not a special case.

> The heuristics are always well debated in this forum and there's little
> chance that we'll ever settle on a single formula that works for all
> possible use cases. That makes oom_adj even more vital to the overall
> efficiency of the oom killer, I really hope you start to use it to your
> advantage.

I really hope your latest badness() heuristics proposal allows us to dump
even the oom_adj use we already have.

--
Lubos Lunak
openSUSE Boosters team, KDE developer
[email protected] , [email protected]

2010-02-03 22:55:10

by Lubos Lunak

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wednesday 03 of February 2010, David Rientjes wrote:
> My rewrite for the badness() heuristic is centered on the idea that scores
> should range from 0 to 1000, 0 meaning "never kill this task" and 1000
> meaning "kill this task first." The baseline for a thread, p, may be
> something like this:
>
> unsigned int badness(struct task_struct *p,
> unsigned long totalram)
> {
> struct task_struct *child;
> struct mm_struct *mm;
> int forkcount = 0;
> long points;
>
> task_lock(p);
> mm = p->mm;
> if (!mm) {
> task_unlock(p);
> return 0;
> }
> points = (get_mm_rss(mm) +
> get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
> totalram;
> task_unlock(p);
>
> list_for_each_entry(child, &p->children, sibling)
> /* No lock, child->mm won't be dereferenced */
> if (child->mm && child->mm != mm)
> forkcount++;
>
> /* Forkbombs get penalized 10% of available RAM */
> if (forkcount > 500)
> points += 100;

As far as I'm concerned, this is a huge improvement over the current code
(and, incidentally :), quite close to what I originally wanted). I'd be
willing to test it in few real-world desktop cases if you provide a patch.

> /*
> * /proc/pid/oom_adj ranges from -1000 to +1000 to either
> * completely disable oom killing or always prefer it.
> */
> points += p->signal->oom_adj;

This changes semantics of oom_adj, but given that I expect the above to make
oom_adj unnecessary on the desktop for the normal cases, I don't really mind.

--
Lubos Lunak
openSUSE Boosters team, KDE developer
[email protected] , [email protected]

2010-02-04 00:00:33

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wed, 3 Feb 2010, Lubos Lunak wrote:

>
> Given that the badness() proposal I see in your another mail uses
> get_mm_rss(), I take it that you've meanwhile changed your mind on the VmSize
> vs VmRSS argument and considered that argument irrelevant now.

The argument was never to never factor rss into the heuristic, the
argument was to prevent the loss of functionality of oom_adj and being
able to define memory leakers from userspace. With my proposal, I believe
the new semantics of oom_adj are even clearer than before and allow users
to either discount or bias a task with a quantity that they are familiar
with: memory.

My rough draft was written in a mail editor, so it's completely untested
and even has a couple of flaws: we need to discount free hugetlb memory
from allowed nodes, we need to intersect the passed nodemask with
current's cpuset, etc.

> I will comment
> only on the suggested use of oom_adj on the desktop here. And actually I hope
> that if something reasonably similar to your badness() proposal replaces the
> current one it will make any use of oom_adj not needed on the desktop in the
> usual case, so this may be irrelevant as well.
>

If you define "on the desktop" performance of the oom killer merely as
protecting a windows environment, then it should be helpful. I'd still
recommend using OOM_DISABLE for those tasks, though, because I agree that
for users in that environment, KDE getting oom killed is just not a viable
solution.

> > The kernel cannot possibly know what you consider a "vital" process, for
> > that understanding you need to tell it using the very powerful
> > /proc/pid/oom_adj tunable. I suspect if you were to product all of
> > kdeinit's children by patching it to be OOM_DISABLE so that all threads it
> > forks will inherit that value you'd actually see much improved behavior.
>
> No. Almost everything in KDE is spawned by kdeinit, so everything would get
> the adjustment, which means nothing would in practice get the adjustment.
>

It depends on whether you change the oom_adj of children that you no
longer want to protect which have been forked from kdeinit.

> > I'd also encourage you to talk to the KDE developers to ensure that proper
> > precautions are taken to protect it in such conditions since people who
> > use such desktop environments typically don't want them to be sacrificed
> > for memory.
>
> I am a KDE developer, it's written in my signature. And I've already talked
> enough to the KDE developer who has done the oom_adj code that's already
> there, as that's also me. I don't know kernel internals, but that doesn't
> mean I'm completely clueless about the topic of the discussion I've started.
>

Then I'd recommend that you protect those tasks with OOM_DISABLE,
otherwise they will always be candidates for oom kill; the only way to
explicitly prevent that is by changing oom_adj or moving it to its own
memory controller cgroup. A kernel oom heursitic that is implemented for
a wide variety of platforms, including desktops, servers, and embedded
devices, will never identify KDE as a vital task that cannot possibly be
killed unless you tell the kernel it has that priority. Whether you
choose to use that power or not is up to the KDE team.

> 1) I think you missed that I said that every KDE application with the current
> algorithm can be potentially a contender for selection, and I provided
> numbers to demonstrate that in a selected case. Just because such application
> is not vital does not mean it's good for it to get killed instead of an
> obvious offender.
>

This is exaggerating the point quite a bit, I don't think every single KDE
thread is going to have a badness() score that is higher than all other
system tasks all the time. I think that there are the likely candidates
that you've identified (kdeinit, ksmserver, etc) that are much more prone
to high badness() scores given their total_vm size and the number of
children they fork, but I don't think this is representative of every KDE
thread.

> 2) You probably do not realize the complexity involved in using oom_adj in a
> desktop. Even when doing that manually I would have some difficulty finding
> the right setup for my own desktop use. It'd be probably virtually impossible
> to write code that would do it at least somewhat right with all the widely
> differing various desktop setups that dynamically change.
>

Used in combination with /proc/pid/oom_score, it gives you a pretty good
snapshot of how oom killer priorities look at any moment in time. In your
particular use case, however, you seem to be arguing from a perspective of
only protecting certain tasks that you've identified from being oom killed
for desktop environments, namely KDE. For that, there is no confusion to
be had: use OOM_DISABLE. For server environments that I'm also concerned
about, the oom_adj range is much more important to define a killing
priority when used in combination with cpusets.

> 3) oom_adj is ultimately just a kludge to handle special cases where the
> heuristic doesn't get it right for whatever strange reason. But even you
> yourself in another mail presented a heuristic that I believe would make any
> use of oom_adj on the desktop unnecessary in the usual cases. The usual
> desktop is not a special case.
>

The kernel will _always_ need user input into which tasks it believes to
be vital. For you, that's KDE. For me, that's one of our job schedulers.

> > The heuristics are always well debated in this forum and there's little
> > chance that we'll ever settle on a single formula that works for all
> > possible use cases. That makes oom_adj even more vital to the overall
> > efficiency of the oom killer, I really hope you start to use it to your
> > advantage.
>
> I really hope your latest badness() heuristics proposal allows us to dump
> even the oom_adj use we already have.
>

For your environment, I hope the same. In production servers we'll still
need the ability to tune /proc/pid/oom_adj to define memory leakers and
tasks using far more memory than expected, so perhaps my rough draft can
be a launching pad into a positive discussion about the future of the
heuristic based on consensus and input from all impacted parties.

2010-02-04 00:05:20

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wed, 3 Feb 2010, Lubos Lunak wrote:

> > unsigned int badness(struct task_struct *p,
> > unsigned long totalram)
> > {
> > struct task_struct *child;
> > struct mm_struct *mm;
> > int forkcount = 0;
> > long points;
> >
> > task_lock(p);
> > mm = p->mm;
> > if (!mm) {
> > task_unlock(p);
> > return 0;
> > }
> > points = (get_mm_rss(mm) +
> > get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
> > totalram;
> > task_unlock(p);
> >
> > list_for_each_entry(child, &p->children, sibling)
> > /* No lock, child->mm won't be dereferenced */
> > if (child->mm && child->mm != mm)
> > forkcount++;
> >
> > /* Forkbombs get penalized 10% of available RAM */
> > if (forkcount > 500)
> > points += 100;
>
> As far as I'm concerned, this is a huge improvement over the current code
> (and, incidentally :), quite close to what I originally wanted). I'd be
> willing to test it in few real-world desktop cases if you provide a patch.
>

There're some things that still need to be worked out, like discounting
hugetlb pages on each allowed node, respecting current's cpuset mems,
etc., but I think it gives us a good rough draft of where we might end up.
I did use the get_mm_rss() that you suggested, but I think it's more
helpful in the context of a fraction of total memory allowed so the other
heursitics (forkbomb, root tasks, nice'd tasks, etc) are penalizing the
points in a known quantity rather than a manipulation of that baseline.

Do you have any comments about the forkbomb detector or its threshold that
I've put in my heuristic? I think detecting these scenarios is still an
important issue that we need to address instead of simply removing it from
consideration entirely.

2010-02-04 00:18:31

by Rik van Riel

[permalink] [raw]
Subject: Re: Improving OOM killer

On 02/03/2010 07:05 PM, David Rientjes wrote:
> On Wed, 3 Feb 2010, Lubos Lunak wrote:

>>> /* Forkbombs get penalized 10% of available RAM */
>>> if (forkcount> 500)
>>> points += 100;

> Do you have any comments about the forkbomb detector or its threshold that
> I've put in my heuristic? I think detecting these scenarios is still an
> important issue that we need to address instead of simply removing it from
> consideration entirely.

I believe that malicious users are best addressed in person,
or preemptively through cgroups and rlimits.

Having a process with over 500 children is quite possible
with things like apache, Oracle, postgres and other forking
daemons.

Killing the parent process can result in the service
becoming unavailable, and in some cases even data
corruption.

--
All rights reversed.

2010-02-04 07:58:40

by Lubos Lunak

[permalink] [raw]
Subject: Re: Improving OOM killer

On Thursday 04 of February 2010, David Rientjes wrote:
> On Wed, 3 Feb 2010, Lubos Lunak wrote:
> > As far as I'm concerned, this is a huge improvement over the current
> > code (and, incidentally :), quite close to what I originally wanted). I'd
> > be willing to test it in few real-world desktop cases if you provide a
> > patch.
>
> There're some things that still need to be worked out,

Ok. Just please do not let the perfect stand in the way of the good for way
too long.

> Do you have any comments about the forkbomb detector or its threshold that
> I've put in my heuristic? I think detecting these scenarios is still an
> important issue that we need to address instead of simply removing it from
> consideration entirely.

I think before finding out the answer it should be first figured out what the
question is :). Besides the vague "forkbomb" description I still don't know
what realistic scenarios this is supposed to handle. IMO trying to cover
intentional abuse is a lost fight, so I think the purpose of this code should
be just to handle cases when there's a mistake leading to relatively fast
spawning of children of a specific parent that'll lead to OOM. The shape of
the children subtree doesn't matter, it can be either a parent with many
direct children, or children being created recursively, I think any case is
possible here. A realistic example would be e.g. by mistake
typing 'make -j22' instead of 'make -j2' and overloading the machine by too
many g++ instances. That would be actually a non-trivial tree of children,
with recursive make and sh processes in it.

A good way to detect this would be checking in badness() if the process has
any children with relatively low CPU and real time values (let's say
something less than a minute). If yes, the badness score should also account
for all these children, recursively. I'm not sure about the exact formula,
just summing up the memory usage like it is done now does not fit your 0-1000
score idea, and it's also wrong because it doesn't take sharing of memory
into consideration (e.g. a KDE app with several kdelibs-based children could
achieve a massive score here because of extensive sharing, even though the
actual memory usage increase caused by them could be insignificant). I don't
know kernel internals, so I don't know how feasible it would be, but one
naive idea would be to simply count how big portion of the total memory all
these considered processes occupy.

This indeed would not handle the case when a tree of processes would slowly
leak, for example there being a bug in Apache and all the forked children of
the master process leaking memory equally, but none of the single children
leaking enough to score more than a single unrelated innocent process. Here I
question how realistic such scenario actually would be, and mainly the actual
possibility of detecting such case. I do not see how code could distinguish
this from the case of using Konsole or XTerm to launch a number of unrelated
KDE/X applications each of which would occupy a considerable amount of
memory. Here clearly killing the Konsole/XTerm and all the spawned
applications with it is incorrect, so with no obvious offender the OOM killer
would simply have to pick something. And since you now probably feel the urge
to point out oom_adj again, I want to point out again that it's not a very
good solution for the desktop and that Konsole/XTerm should not have such
protection, unless the user explicitly does it themselves - e.g. Konsole can
be set to infinite scrollback, so when accidentally running something that
produces a huge amount of output Konsole actually could be the only right
process to kill. So I think the case of slowly leaking group of children
cannot be reasonably solved in code.

--
Lubos Lunak
openSUSE Boosters team, KDE developer
[email protected] , [email protected]

2010-02-04 09:50:56

by Jiri Kosina

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wed, 3 Feb 2010, David Rientjes wrote:

> > > /* Forkbombs get penalized 10% of available RAM */
> > > if (forkcount > 500)
> > > points += 100;
> >
> > As far as I'm concerned, this is a huge improvement over the current code
> > (and, incidentally :), quite close to what I originally wanted). I'd be
> > willing to test it in few real-world desktop cases if you provide a patch.
[ ... ]
> Do you have any comments about the forkbomb detector or its threshold that
> I've put in my heuristic? I think detecting these scenarios is still an
> important issue that we need to address instead of simply removing it from
> consideration entirely.

Why does OOM killer care about forkbombs *at all*?

If we really want kernel to detect forkbombs (*), we'd have to establish
completely separate infrastructure for that (with its own knobs for tuning
and possibilities of disabling it completely).

The task of OOM killer is to find the process that caused the system
to run out of memory, and wipe it, it's as simple as that.

The connection to forkbombs seems to be so loose that I don't see it.

(*) How is forkbomb even defined? Where does the magic constant in
'forkcount > 500' come from? If your aim is to penalize server processes
on very loaded web/database servers, then this is probably correct
aproach. Otherwise, I don't seem to see the point.

Thanks,

--
Jiri Kosina
SUSE Labs, Novell Inc.

2010-02-04 21:34:48

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Thu, 4 Feb 2010, Lubos Lunak wrote:

> > There're some things that still need to be worked out,
>
> Ok. Just please do not let the perfect stand in the way of the good for way
> too long.
>

The changes to the heuristic will be a part of a larger patchset that
basically rewrites the oom killer since there have actually been many
other suggestions and asides that have been mentioned during the course of
the discussion: mempolicy targeted oom killing, killing a child only with
the highest badness score first, always ensuring the killed task shares a
set of allowed nodes with current, preempting the oom killer for GFP_DMA
allocations, etc. I'll write that patchset within the next couple of
days, but we're still talking about 2.6.35 material at the earliest. If
you could test it with your particular usecase we can make sure to work
any heuristic problems out in the early stages.

> > Do you have any comments about the forkbomb detector or its threshold that
> > I've put in my heuristic? I think detecting these scenarios is still an
> > important issue that we need to address instead of simply removing it from
> > consideration entirely.
>
> I think before finding out the answer it should be first figured out what the
> question is :). Besides the vague "forkbomb" description I still don't know
> what realistic scenarios this is supposed to handle. IMO trying to cover
> intentional abuse is a lost fight, so I think the purpose of this code should
> be just to handle cases when there's a mistake leading to relatively fast
> spawning of children of a specific parent that'll lead to OOM.

Yes, forkbombs are not always malicious, they can be the result of buggy
code and there's no other kernel mechanism that will hold them off so that
the machine is still usable. If a task forks and execve's thousands of
threads on your 2GB desktop machine either because its malicious, its a
bug, or a the user made a mistake, that's going to be detrimental
depending on the nature of what was executed especially to your
interactivity :) Keep in mind that the forking parent such as a job
scheduler or terminal and all of its individual children may have very
small rss and swap statistics, even though cumulatively its a problem.
In that scenario, KDE is once again going to become the target on your
machine when in reality we should target the forkbomb.

How we target the forkbomb could be another topic for discussion. Recall
that the oom killer does not necessarily always kill the task with the
highest badness() score; it will always attempt to kill one of its
children with a seperate mm first. That doesn't help us as much as it
could in the forkbomb scenario since the parent could likely continue to
fork; the end result you'll see is difficulty in getting to a command line
where you can find and kill the parent yourself. Thus, we need to preempt
the preference to kill the child first when we've detected a forkbomb task
and it is selected for oom kill (detecting a forkbomb is only a
penalization in the heuristic, it doesn't mean automatic killing, so
another task consuming much more memory could be chosen instead).

> The shape of
> the children subtree doesn't matter, it can be either a parent with many
> direct children, or children being created recursively, I think any case is
> possible here. A realistic example would be e.g. by mistake
> typing 'make -j22' instead of 'make -j2' and overloading the machine by too
> many g++ instances. That would be actually a non-trivial tree of children,
> with recursive make and sh processes in it.
>

We can't address recursive forkbombing in the oom killer with any
efficiency, but luckily those cases aren't very common.

> A good way to detect this would be checking in badness() if the process has
> any children with relatively low CPU and real time values (let's say
> something less than a minute). If yes, the badness score should also account
> for all these children, recursively. I'm not sure about the exact formula,
> just summing up the memory usage like it is done now does not fit your 0-1000
> score idea, and it's also wrong because it doesn't take sharing of memory
> into consideration (e.g. a KDE app with several kdelibs-based children could
> achieve a massive score here because of extensive sharing, even though the
> actual memory usage increase caused by them could be insignificant). I don't
> know kernel internals, so I don't know how feasible it would be, but one
> naive idea would be to simply count how big portion of the total memory all
> these considered processes occupy.
>

badness() can't be called recursively on children, so we need to look for
the simple metrics like you mentioned: cpu time, for example. I think we
should also look at uid

> This indeed would not handle the case when a tree of processes would slowly
> leak, for example there being a bug in Apache and all the forked children of
> the master process leaking memory equally, but none of the single children
> leaking enough to score more than a single unrelated innocent process. Here I
> question how realistic such scenario actually would be, and mainly the actual
> possibility of detecting such case. I do not see how code could distinguish
> this from the case of using Konsole or XTerm to launch a number of unrelated
> KDE/X applications each of which would occupy a considerable amount of
> memory. Here clearly killing the Konsole/XTerm and all the spawned
> applications with it is incorrect, so with no obvious offender the OOM killer
> would simply have to pick something.

The memory consumption of these children were not considered in my rough
draft, it was simply a counter of how many first-generation children each
task has. When Konsole forks a very large number of tasks, is it
unreasonable to bias it with a penalty of perhaps 10% of RAM? Or should
we take the lowest rss size of those children, multiply it by the number
of children, and penalize it after it reaches a certain threshold (500?
1000?) as the "cost of running the parent"? This heavily biases parents
that have forked a very large number of small threads that would be
directly killed whereas the large, memory-hog children are killed
themselves based on their own badness() heuristic.

2010-02-04 21:39:17

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Thu, 4 Feb 2010, Jiri Kosina wrote:

> Why does OOM killer care about forkbombs *at all*?
>

Because the cumulative effects of a forkbomb are detrimental to the
system and the badness() heursitic favors large memory consumers very
heavily. Thus, the forkbomb is never really a strong candidate for oom
kill since the parent may consume very little memory itself and meanwhile
KDE or another large memory consumer will get innocently killed instead as
a result.

> If we really want kernel to detect forkbombs (*), we'd have to establish
> completely separate infrastructure for that (with its own knobs for tuning
> and possibilities of disabling it completely).
>

That's what we're trying to do, we can look at the shear number of
children that the parent has forked and check for it to be over a certain
"forkbombing threshold" (which, yes, can be tuned from userspace), the
uptime of those children, their resident set size, etc., to attempt to
find a sane heuristic that penalizes them.

2010-02-04 21:49:10

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wed, 3 Feb 2010, Rik van Riel wrote:

> > Do you have any comments about the forkbomb detector or its threshold that
> > I've put in my heuristic? I think detecting these scenarios is still an
> > important issue that we need to address instead of simply removing it from
> > consideration entirely.
>
> I believe that malicious users are best addressed in person,
> or preemptively through cgroups and rlimits.
>

Forkbombs need not be the result of malicious users.

> Having a process with over 500 children is quite possible
> with things like apache, Oracle, postgres and other forking
> daemons.
>

It's clear that the forkbomb threshold would need to be definable from
userspace and probably default to something high such as 1000.

Keep in mind that we're in the oom killer here, though. So we're out of
memory and we need to kill something; should Apache, Oracle, and postgres
not be penalized for their cost of running by factoring in something like
this?

(lowest rss size of children) * (# of first-generation children) /
(forkbomb threshold)

> Killing the parent process can result in the service
> becoming unavailable, and in some cases even data
> corruption.
>

There's only one possible rememdy for that, which is OOM_DISABLE; the oom
killer cannot possibly predict data corruption as the result of killing a
process and this is no different. Everything besides init, kthreads,
OOM_DISABLE threads, and threads that do not share the same cpuset, memcg,
or set of allowed mempolicy nodes are candidates for oom kill.

2010-02-04 22:07:40

by Rik van Riel

[permalink] [raw]
Subject: Re: Improving OOM killer

On 02/04/2010 04:48 PM, David Rientjes wrote:

> Keep in mind that we're in the oom killer here, though. So we're out of
> memory and we need to kill something; should Apache, Oracle, and postgres
> not be penalized for their cost of running by factoring in something like
> this?

No, they should not.

The goal of the OOM killer is to kill some process, so the
system can continue running and automatically become available
again for whatever workload the system was running.

Killing the parent process of one of the system daemons does
not achieve that goal, because you now caused a service to no
longer be available.

2010-02-04 22:14:59

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Thu, 4 Feb 2010, Rik van Riel wrote:

> > Keep in mind that we're in the oom killer here, though. So we're out of
> > memory and we need to kill something; should Apache, Oracle, and postgres
> > not be penalized for their cost of running by factoring in something like
> > this?
>
> No, they should not.
>
> The goal of the OOM killer is to kill some process, so the
> system can continue running and automatically become available
> again for whatever workload the system was running.
>
> Killing the parent process of one of the system daemons does
> not achieve that goal, because you now caused a service to no
> longer be available.
>

The system daemon wouldn't be killed, though. You're right that this
heuristic would prefer the system daemon slightly more as a result of the
forkbomb penalty, but the oom killer always attempts to sacrifice a child
with a seperate mm before killing the selected task. Since the forkbomb
heuristic only adds up those children with seperate mms, we're guaranteed
to not kill the daemon itself.

2010-02-04 22:31:44

by Frans Pop

[permalink] [raw]
Subject: Re: Improving OOM killer

David Rientjes wrote:
> It's clear that the forkbomb threshold would need to be definable from
> userspace and probably default to something high such as 1000.
>
> Keep in mind that we're in the oom killer here, though. So we're out of
> memory and we need to kill something; should Apache, Oracle, and postgres
> not be penalized for their cost of running by factoring in something like
> this?
>
> (lowest rss size of children) * (# of first-generation children) /
> (forkbomb threshold)

Shouldn't fork bomb detection take into account the age of children?
After all, long running processes with a lot of long running children are
rather unlikely to be runaway fork _bombs_.

Children for desktop environments are more likely to be long running than
e.g. a server process that's being DOSed.
The goal of the OOM killer is IIUC trying to identify the process thats
causing the immediate problem so in this example it should prefer latter.

Cheers,
FJP

2010-02-04 22:53:42

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Thu, 4 Feb 2010, Frans Pop wrote:

> Shouldn't fork bomb detection take into account the age of children?
> After all, long running processes with a lot of long running children are
> rather unlikely to be runaway fork _bombs_.
>

Yeah, Lubos mentioned using cpu time as a requirement, in addition to the
already existing child->mm != parent->mm, as a prerequisite to be added
into the tally to check the forkbomb threshold. I think something like
this would be appropriate:

struct task_cputime task_time;
int forkcount = 0;
int child_rss = 0;

...

list_for_each_entry(child, &p->children, sibling) {
unsigned long runtime;

task_lock(child);
if (!child->mm || child->mm == p->mm) {
task_unlock(child);
continue;
}
thread_group_cputime(child, &task_time);
runtime = cputime_to_jiffies(task_time.utime) +
cputime_to_jiffies(task_time.stime);

/*
* Only threads that have run for less than a second are
* considered toward the forkbomb, these threads rarely
* get to execute at all in such cases anyway.
*/
if (runtime < HZ) {
task_unlock(child);
continue;
}
child_rss += get_mm_rss(child->mm);
forkcount++;
}

if (forkcount > sysctl_oom_forkbomb_thres) {
/*
* Penalize forkbombs by considering the average rss and
* how many factors we are over the threshold.
*/
points += child_rss / sysctl_oom_forkbomb_thres;
}

I changed the calculation from lowest child rss to average child rss, so
this is functionally equivalent to

(average rss size of children) * (# of first-generated execve children) /
sysctl_oom_forkbomb_thres

2010-02-07 22:58:07

by Oliver Neukum

[permalink] [raw]
Subject: Re: Improving OOM killer

Am Donnerstag, 4. Februar 2010 22:39:08 schrieb David Rientjes:
> > If we really want kernel to detect forkbombs (*), we'd have to establish
> > completely separate infrastructure for that (with its own knobs for tuning
> > and possibilities of disabling it completely).
> >
>
> That's what we're trying to do, we can look at the shear number of
> children that the parent has forked and check for it to be over a certain
> "forkbombing threshold" (which, yes, can be tuned from userspace), the
> uptime of those children, their resident set size, etc., to attempt to
> find a sane heuristic that penalizes them.

Wouldn't it be saner to have a selection by user, so that users that
are over the overcommit limit are targeted?

Regards
Oliver

2010-02-10 03:10:20

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Fri, 5 Feb 2010, Oliver Neukum wrote:

> > That's what we're trying to do, we can look at the shear number of
> > children that the parent has forked and check for it to be over a certain
> > "forkbombing threshold" (which, yes, can be tuned from userspace), the
> > uptime of those children, their resident set size, etc., to attempt to
> > find a sane heuristic that penalizes them.
>
> Wouldn't it be saner to have a selection by user, so that users that
> are over the overcommit limit are targeted?
>

It's rather unnecessary for the forkbomb case because then it would
unfairly penalize any user that runs lots of tasks. The forkbomb handling
code is really to prevent either user error, bugs in the application, or
maliciousness. The goal isn't necessarily to kill anything that forks an
egregious amount of tasks (which is why we always prefer a child with a
seperate address space than a parent, anyway) but rather to try to make
sure the system is still usable and recoverable.

2010-02-10 20:54:48

by Lubos Lunak

[permalink] [raw]
Subject: Re: Improving OOM killer

On Thursday 04 of February 2010, David Rientjes wrote:
> On Thu, 4 Feb 2010, Lubos Lunak wrote:
> > I think before finding out the answer it should be first figured out
> > what the question is :). Besides the vague "forkbomb" description I still
> > don't know what realistic scenarios this is supposed to handle. IMO
> > trying to cover intentional abuse is a lost fight, so I think the purpose
> > of this code should be just to handle cases when there's a mistake
> > leading to relatively fast spawning of children of a specific parent
> > that'll lead to OOM.
>
> Yes, forkbombs are not always malicious, they can be the result of buggy
> code and there's no other kernel mechanism that will hold them off so that
> the machine is still usable. If a task forks and execve's thousands of
> threads on your 2GB desktop machine either because its malicious, its a
> bug, or a the user made a mistake, that's going to be detrimental
> depending on the nature of what was executed especially to your
> interactivity :) Keep in mind that the forking parent such as a job
> scheduler or terminal and all of its individual children may have very
> small rss and swap statistics, even though cumulatively its a problem.

Which is why I suggested summing up the memory of the parent and its
children.

> > The shape of
> > the children subtree doesn't matter, it can be either a parent with many
> > direct children, or children being created recursively, I think any case
> > is possible here. A realistic example would be e.g. by mistake
> > typing 'make -j22' instead of 'make -j2' and overloading the machine by
> > too many g++ instances. That would be actually a non-trivial tree of
> > children, with recursive make and sh processes in it.
>
> We can't address recursive forkbombing in the oom killer with any
> efficiency, but luckily those cases aren't very common.

Right, I've never run a recursive make that brought my machine to its knees.
Oh, wait.

And why exactly is iterating over 1st level children efficient enough and
doing that recursively is not? I don't find it significantly more expensive
and badness() is hardly a bottleneck anyway.

> > A good way to detect this would be checking in badness() if the process
> > has any children with relatively low CPU and real time values (let's say
> > something less than a minute). If yes, the badness score should also
> > account for all these children, recursively. I'm not sure about the exact
> > formula, just summing up the memory usage like it is done now does not
> > fit your 0-1000 score idea, and it's also wrong because it doesn't take
> > sharing of memory into consideration (e.g. a KDE app with several
> > kdelibs-based children could achieve a massive score here because of
> > extensive sharing, even though the actual memory usage increase caused by
> > them could be insignificant). I don't know kernel internals, so I don't
> > know how feasible it would be, but one naive idea would be to simply
> > count how big portion of the total memory all these considered processes
> > occupy.
>
> badness() can't be called recursively on children,
> the simple metrics like you mentioned: cpu time, for example.

I didn't mean calling badness() recursively, just computing total memory
usage of the parent and all the children together (and trying not to count
shared parts more than once). If doing that accurately is too expensive,
summing rss+swap for the parent and using only unshared memory for the
children seems reasonably close to it and pretty cheap (surely if the code
can find out rss+swap for each child it can equally easily find out how much
of it is unshared?). With the assumption that the shared memory will be
hopefully reasonably accounted in the parent's memory usage and thus only
unshared portions for children are needed, this appears to be much more
precise than trying to sum up rss for all like your proposal does.

> > This indeed would not handle the case when a tree of processes would
> > slowly leak, for example there being a bug in Apache and all the forked
> > children of the master process leaking memory equally, but none of the
> > single children leaking enough to score more than a single unrelated
> > innocent process. Here I question how realistic such scenario actually
> > would be, and mainly the actual possibility of detecting such case. I do
> > not see how code could distinguish this from the case of using Konsole or
> > XTerm to launch a number of unrelated KDE/X applications each of which
> > would occupy a considerable amount of memory. Here clearly killing the
> > Konsole/XTerm and all the spawned applications with it is incorrect, so
> > with no obvious offender the OOM killer would simply have to pick
> > something.
>
> The memory consumption of these children were not considered in my rough
> draft, it was simply a counter of how many first-generation children each
> task has.

Why exactly do you think only 1st generation children matter? Look again at
the process tree posted by me and you'll see it solves nothing there. I still
fail to see why counting also all other generations should be considered
anything more than a negligible penalty for something that's not a bottleneck
at all.

> When Konsole forks a very large number of tasks, is it
> unreasonable to bias it with a penalty of perhaps 10% of RAM?

It appears unresonable to penalize it with a random magic number.

> Or should
> we take the lowest rss size of those children, multiply it by the number
> of children, and penalize it after it reaches a certain threshold (500?
> 1000?) as the "cost of running the parent"?

And another magic number. And again it doesn't solve my initial problem,
where the number of processes taking part in the OOM situation is counted
only in tens.

Simply computing the cost of the whole children subtree (or a reasonable
approximation) avoids the need for any magic numbers and gives a much better
representation of how costly the subtree is, since, well, it is the cost
itself.

> This heavily biases parents
> that have forked a very large number of small threads that would be
> directly killed whereas the large, memory-hog children are killed
> themselves based on their own badness() heuristic.

--
Lubos Lunak
openSUSE Boosters team, KDE developer
[email protected] , [email protected]

2010-02-10 20:54:46

by Lubos Lunak

[permalink] [raw]
Subject: Re: Improving OOM killer

On Thursday 04 of February 2010, David Rientjes wrote:
> On Thu, 4 Feb 2010, Rik van Riel wrote:
> > The goal of the OOM killer is to kill some process, so the
> > system can continue running and automatically become available
> > again for whatever workload the system was running.
> >
> > Killing the parent process of one of the system daemons does
> > not achieve that goal, because you now caused a service to no
> > longer be available.
>
> The system daemon wouldn't be killed, though. You're right that this
> heuristic would prefer the system daemon slightly more as a result of the
> forkbomb penalty, but the oom killer always attempts to sacrifice a child
> with a seperate mm before killing the selected task. Since the forkbomb
> heuristic only adds up those children with seperate mms, we're guaranteed
> to not kill the daemon itself.

Which however can mean that not killing this system daemon will be traded for
DoS-ing the whole system, if the daemon keeps spawning new children as soon
as the OOM killer frees up resources for them.

This looks like wrong solution to me, it's like trying to save a target by
shooting all incoming bombs instead of shooting the bomber. If the OOM
situation is caused by one or a limited number of its children, or if the
system daemon is not reponsible for the forkbomb (e.g. it's only a subtree of
its children), then it won't be selected for killing anyway. If it is
responsible for the forkbomb, the OOM killer can trying killing the bombs
forever to no avail.

--
Lubos Lunak
openSUSE Boosters team, KDE developer
[email protected] , [email protected]

2010-02-10 21:09:49

by Rik van Riel

[permalink] [raw]
Subject: Re: Improving OOM killer

On 02/10/2010 03:54 PM, Lubos Lunak wrote:

> Simply computing the cost of the whole children subtree (or a reasonable
> approximation) avoids the need for any magic numbers and gives a much better
> representation of how costly the subtree is, since, well, it is the cost
> itself.

That assumes you want to kill off that entire tree.

You will not want to do that when a web server or
database server runs out of memory, because the
goal of the OOM killer is to allow the system to
continue to run and be useful. This means keeping
the services available...

--
All rights reversed.

2010-02-10 21:11:14

by Rik van Riel

[permalink] [raw]
Subject: Re: Improving OOM killer

On 02/10/2010 03:54 PM, Lubos Lunak wrote:

> Which however can mean that not killing this system daemon will be traded for
> DoS-ing the whole system, if the daemon keeps spawning new children as soon
> as the OOM killer frees up resources for them.

Killing the system daemon *is* a DoS.

It would stop eg. the database or the web server, which is
generally the main task of systems that run a database or
a web server.

--
All rights reversed.

2010-02-10 21:29:28

by Lubos Lunak

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wednesday 10 of February 2010, Rik van Riel wrote:
> On 02/10/2010 03:54 PM, Lubos Lunak wrote:
> > Which however can mean that not killing this system daemon will be
> > traded for DoS-ing the whole system, if the daemon keeps spawning new
> > children as soon as the OOM killer frees up resources for them.
>
> Killing the system daemon *is* a DoS.

Maybe, but if there are two such system daemons on the machine, it's only
half of the other DoS. And since that system daemon has already been
identified as a forkbomb, it's probably already useless anyway and killing
the children won't save anything. In which realistic case a system daemon has
children that together cause OOM, yet can still be considered working after
you kill one or a limited number of those children?

> It would stop eg. the database or the web server, which is
> generally the main task of systems that run a database or
> a web server.

--
Lubos Lunak
openSUSE Boosters team, KDE developer
[email protected] , [email protected]

2010-02-10 21:34:44

by Lubos Lunak

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wednesday 10 of February 2010, Rik van Riel wrote:
> On 02/10/2010 03:54 PM, Lubos Lunak wrote:
> > Simply computing the cost of the whole children subtree (or a
> > reasonable approximation) avoids the need for any magic numbers and gives
> > a much better representation of how costly the subtree is, since, well,
> > it is the cost itself.
>
> That assumes you want to kill off that entire tree.

As said in another mail, I think I actually do, since the entire tree is
indentified as the problem. But regardless of that, surely computing the cost
of a forkbomb by computing something that is close to the actual cost of it
is better than trying magic numbers?

--
Lubos Lunak
openSUSE Boosters team, KDE developer
[email protected] , [email protected]

2010-02-10 22:20:19

by Alan

[permalink] [raw]
Subject: Re: Improving OOM killer

> Killing the system daemon *is* a DoS.
>
> It would stop eg. the database or the web server, which is
> generally the main task of systems that run a database or
> a web server.

One of the problems with picking on tasks that fork a lot is that
describes apache perfectly. So a high loaded apache will get shot over a
rapid memory eating cgi script.

Any heuristic is going to be iffy - but that isn't IMHO a good one to
work from. If anything "who allocated lots of RAM recently" may be a
better guide but we don't keep stats for that.

Alan

2010-02-10 22:25:23

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wed, 10 Feb 2010, Lubos Lunak wrote:

> > Yes, forkbombs are not always malicious, they can be the result of buggy
> > code and there's no other kernel mechanism that will hold them off so that
> > the machine is still usable. If a task forks and execve's thousands of
> > threads on your 2GB desktop machine either because its malicious, its a
> > bug, or a the user made a mistake, that's going to be detrimental
> > depending on the nature of what was executed especially to your
> > interactivity :) Keep in mind that the forking parent such as a job
> > scheduler or terminal and all of its individual children may have very
> > small rss and swap statistics, even though cumulatively its a problem.
>
> Which is why I suggested summing up the memory of the parent and its
> children.
>

That's almost identical to the current heuristic where we sum half the
size of the children's VM size, unfortunately it's not a good indicator of
forkbombs since in your particular example it would be detrimental to
kdeinit. My heursitic considers runtime of the children as an indicator
of a forkbombing parent since such tasks don't typically get to run
anyway. The rss or swap usage of a child with a seperate address space
simply isn't relevant to the badness score of the parent, it unfairly
penalizes medium/large server jobs.

> > We can't address recursive forkbombing in the oom killer with any
> > efficiency, but luckily those cases aren't very common.
>
> Right, I've never run a recursive make that brought my machine to its knees.
> Oh, wait.
>

That's completely outside the scope of the oom killer, though: it is _not_
the oom killer's responsibility for enforcing a kernel-wide forkbomb
policy, which would be much better handled at execve() time.

It's a very small part of my badness heuristic, depending on the average
size of the children's rss and swap usage, because we want to slightly
penalize tasks that fork an extremely large number of tasks that have no
substantial runtime; memory is being consumed but very little work is
getting done by those thousand children. This would most often than not
be used only to break ties when two parents have similar memory
consumption themselves but one is obviously oversubscribing the system.

> And why exactly is iterating over 1st level children efficient enough and
> doing that recursively is not? I don't find it significantly more expensive
> and badness() is hardly a bottleneck anyway.
>

If we look at children's memory usage recursively, then we'll always end
up selecting init_task.

> > The memory consumption of these children were not considered in my rough
> > draft, it was simply a counter of how many first-generation children each
> > task has.
>
> Why exactly do you think only 1st generation children matter? Look again at
> the process tree posted by me and you'll see it solves nothing there. I still
> fail to see why counting also all other generations should be considered
> anything more than a negligible penalty for something that's not a bottleneck
> at all.
>

You're specifying a problem that is outside the scope of the oom killer,
sorry.

2010-02-10 22:31:57

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wed, 10 Feb 2010, Alan Cox wrote:

> One of the problems with picking on tasks that fork a lot is that
> describes apache perfectly. So a high loaded apache will get shot over a
> rapid memory eating cgi script.
>

With my rewrite, the oom killer would not select apache but rather the
child with a seperate address space that is consuming the most amount of
allowed memory and only when a configurable number of such children (1000
by default) have not had any runtime. My heuristic is only meant to
slightly penalize such tasks so that they can be distinguished from oom
kill from other parents with comparable memory usage. Enforcing a strict
forkbomb policy is out of the scope of the oom killer, though, so no
attempt was made.

> Any heuristic is going to be iffy - but that isn't IMHO a good one to
> work from. If anything "who allocated lots of RAM recently" may be a
> better guide but we don't keep stats for that.
>

That's what my heuristic basically does, if a parent is identified as a
forkbomb, then it is only penalized by averaging the memory consumption of
those children and then multiplying it by the same number of times the
configurable forkbomb threshold was reached.

2010-02-11 09:50:44

by Lubos Lunak

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wednesday 10 of February 2010, Alan Cox wrote:
> > Killing the system daemon *is* a DoS.
> >
> > It would stop eg. the database or the web server, which is
> > generally the main task of systems that run a database or
> > a web server.
>
> One of the problems with picking on tasks that fork a lot is that
> describes apache perfectly. So a high loaded apache will get shot over a
> rapid memory eating cgi script.

It will not. If it's only a single cgi script, that that child should be
selected by badness(), not the parent.

I personally consider the logic of trying to find the offender using
badness() and then killing its child instead to be flawed. Already badness()
itself should select what to kill and that should be killed. If it's a single
process that is the offender, it should be killed. If badness() decides it is
a whole subtree responsible for the situation, then the top of it needs to be
killed, otherwise the reason for the problem will remain.

I expect the current logic of trying to kill children first is based on the
system daemon logic, but if e.g. Apache master process itself causes OOM,
then the kernel itself has to way to find out if it's an important process
that should be protected or if it's some random process causing a forkbomb.
>From the kernel point's of view, if the Apache master process caused the
problem, the the problem should be solved there. If the reason for the
problem was actually e.g. a temporary high load on the server, then Apache is
probably misconfigured, and if it really should stay running no matter what,
then I guess that's the case to use oom_adj. But otherwise, from OOM killer's
point of view, that is where the problem was.

Of course, the algorithm used in badness() should be careful not to propagate
the excessive memory usage in that case to the innocent parent. This problem
existed in the current code until it was fixed by the "/2" recently, and at
least my current proposal actually suffers from it too. But I envision
something like this could handle it nicely (pseudocode):

int oom_children_memory_usage(task)
{
// Memory shared with the parent should not be counted again.
// Since it's expensive to find that out exactly, just assume
// that the amount of shared memory that is not shared with the parent
// is insignificant.
total = unshared_rss(task)+unshared_swap(task);
foreach_child(child,task)
total += oom_children_memory_usage(child);
return total;
}
int badness(task)
{
int total_memory = 0;
...
int max_child_memory = 0; // memory used by that child
int max_child_memory_2 = 0; // the 2nd most memory used by a child
foreach_child(child,task)
{
if(sharing_the_same_memory(child,task))
continue;
if( real_time(child) > 1minute )
continue; // running long, not a forkbomb
int memory = oom_children_memory_usage(task);
total_memory += memory;
if( memory > max_child_memory )
{
max_child_memory_2 = max_child_memory;
max_child_memory = memory;
}
else if( memory > max_child_memory_2 )
max_child_memory_2 = memory;
}
if( max_child_memory_2 != 0 ) // there were at least two children
{
if( max_child_memory > max_child_memory_2 / 2 )
{
// There is only a single child that contributes the majority of memory
// used by all children. Do not add it to the total, so that if that process
// is the biggest offender, the killer picks it instead of this parent.
total_memory -= max_child_memory;
}
}
...
}

The logic is simply that a process is responsible for its children only if
their cost is similar. If one of them stands out, it is responsible for
itself and the parent is not. This is intentionally not done recursively in
oom_children_memory_usage() to cover also the case when e.g. parallel make
runs too many processes wrapped by shell, in that case making any of those
shell instances responsible for its child doesn't help anything, but making
make responsible for all of them helps.

Alternatively, if somebody has a good use case where first going after a
child may make sense, then it perhaps would help to
add 'oom_recently_killed_children' to each task, and increasing it whenever a
child is killed instead of the responsible parent. As soon as the value
within a reasonably short time is higher than let's say 5, then apparently
killing children does not help and the mastermind has to go.

--
Lubos Lunak
openSUSE Boosters team, KDE developer
[email protected] , [email protected]

2010-02-11 10:16:13

by Lubos Lunak

[permalink] [raw]
Subject: Re: Improving OOM killer

On Wednesday 10 of February 2010, David Rientjes wrote:
> On Wed, 10 Feb 2010, Lubos Lunak wrote:
> > Which is why I suggested summing up the memory of the parent and its
> > children.
>
> That's almost identical to the current heuristic where we sum half the
> size of the children's VM size, unfortunately it's not a good indicator of
> forkbombs since in your particular example it would be detrimental to
> kdeinit.

I believe that with the algorithm no longer using VmSize and being careful
not to count shared memory more than once this would not be an issue and
kdeinit would be reasonably safe. KDE does not use _that_ much memory to
score higher than something that caused OOM :).

> My heursitic considers runtime of the children as an indicator
> of a forkbombing parent since such tasks don't typically get to run
> anyway. The rss or swap usage of a child with a seperate address space
> simply isn't relevant to the badness score of the parent, it unfairly
> penalizes medium/large server jobs.

Our definitions of 'forkbomb' then perhaps differ a bit. I
consider 'make -j100' a kind of a forkbomb too, it will very likely overload
the machine too as soon as the gcc instances use up all the memory. For that
reason also using CPU time <1second will not work here, while using real time
<1minute would.

That long timeout would have the weakness that when running at the same time
reasonable 'make -j4' and Firefox that'd immediatelly go crazy, then maybe
the make job could be targeted instead if its total cost would go higher.
However, here I again believe that the fixed metrics for computing memory
usage would work well enough to let that happen only when the total cost of
the make job would be actually higher than that of the offender and in that
case it is kind of an offender too.

Your protection seems to cover only "for(;;) if(fork() == 0) break;" , while
I believe mine could handle also "make -j100" or the bash forkbomb ":()
{ :|:& };:" (i.e. "for(;;) fork();").

> > > We can't address recursive forkbombing in the oom killer with any
> > > efficiency, but luckily those cases aren't very common.
> >
> > Right, I've never run a recursive make that brought my machine to its
> > knees. Oh, wait.
>
> That's completely outside the scope of the oom killer, though: it is _not_
> the oom killer's responsibility for enforcing a kernel-wide forkbomb
> policy

Why? It repeatedly causes OOM here (and in fact it is the only common OOM or
forkbomb I ever encounter). If OOM killer is the right place to protect
against a forkbomb that spawns a large number of 1st level children, then I
don't see how this is different.

> > And why exactly is iterating over 1st level children efficient enough
> > and doing that recursively is not? I don't find it significantly more
> > expensive and badness() is hardly a bottleneck anyway.
>
> If we look at children's memory usage recursively, then we'll always end
> up selecting init_task.

Not if the algorithm does not propagate the top of the problematic subtree
higher, see my reply to Alan Cox.

> > Why exactly do you think only 1st generation children matter? Look again
> > at the process tree posted by me and you'll see it solves nothing there.
> > I still fail to see why counting also all other generations should be
> > considered anything more than a negligible penalty for something that's
> > not a bottleneck at all.
>
> You're specifying a problem that is outside the scope of the oom killer,
> sorry.

But it could be inside of the scope, since it causes OOM, and I don't think
it's an unrealistic or rare use case. I don't consider it less likely than
spawning a large number of direct children. If you want to cover only
certified reasons for causing OOM, it can be as well said that userspace is
not allowed to cause OOM at all.

--
Lubos Lunak
openSUSE Boosters team, KDE developer
[email protected] , [email protected]

2010-02-11 21:17:38

by David Rientjes

[permalink] [raw]
Subject: Re: Improving OOM killer

On Thu, 11 Feb 2010, Lubos Lunak wrote:

> I believe that with the algorithm no longer using VmSize and being careful
> not to count shared memory more than once this would not be an issue and
> kdeinit would be reasonably safe. KDE does not use _that_ much memory to
> score higher than something that caused OOM :).
>

Your suggestion of summing up the memory of the parent and its children
would clearly bias kdeinit if it forks most of kde's threads as you
mentioned earlier in the thread. Imagine it, or another server
application that Rik mentioned, if all children are first generation: then
it would always be selected if that it is the only task operating on the
system. For a web server, for instance, where each query is handled by a
seperate thread, we'd obviously prefer to kill a child thread instead of
making the entire server unresponsive. That type of algorithm in the oom
killer and to kill the parent instead is just a non-starter.

> Our definitions of 'forkbomb' then perhaps differ a bit. I
> consider 'make -j100' a kind of a forkbomb too, it will very likely overload
> the machine too as soon as the gcc instances use up all the memory. For that
> reason also using CPU time <1second will not work here, while using real time
> <1minute would.
>

1 minute? Unless you've got one of SGI's 4K cpu machines where these 1000
threads would actually get any runtime _at_all_ in such circumstances,
that threshold is unreasonable.

A valid point that wasn't raised is although we can't always detect out of
control forking applications, we certainly should do some due diligence in
making sure other applications aren't unfairly penalized when you do
make -j100, for example. That's not the job of the forkbomb detector in
my heuristic, however, it's the job of the baseline itself. In such
scenarios (and when we can't allocate or free any memory), the baseline is
responsible for identifying these tasks and killing them itself because
they are using an excessive amount of memory.

> Your protection seems to cover only "for(;;) if(fork() == 0) break;" , while
> I believe mine could handle also "make -j100" or the bash forkbomb ":()
> { :|:& };:" (i.e. "for(;;) fork();").
>

Again, it's not protection against forkbombs: the oom killer is not the
place where you want to enforce any policy that prohibits that.

> Why? It repeatedly causes OOM here (and in fact it is the only common OOM or
> forkbomb I ever encounter). If OOM killer is the right place to protect
> against a forkbomb that spawns a large number of 1st level children, then I
> don't see how this is different.
>

We're not protecting against a large number of first-generation children,
we're simply penalizing them because the oom killer chooses to kill a
large memory-hogging task instead of the parent first. This shouldn't be
described as "forkbomb detection" because thats outside the scope of the
oom killer or VM, for that matter.