2009-01-12 15:33:27

by Evgeniy Polyakov

[permalink] [raw]
Subject: Linux killed Kenny, bastard!

Hi.

Do you want to own a tame killer? Do you want to control the world?

Start with your computer now and own the planet next: you already have
an OOM-killer in the Linux to kill for you. But to date it was quite
berserk and usually killed not what you would like him to murder.

Now you can add a name of the victims, which will be checked by the
oom-killer, who select the process to kill first among the ones which
have given string in their executable name.

By default the process to be killed is called 'Kenny', and if you like
him, change then name by calling

echo Java > /proc/sys/vm/oom_victim

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3d56fe7..26d4361 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -68,6 +68,7 @@ extern int print_fatal_signals;
extern int sysctl_overcommit_memory;
extern int sysctl_overcommit_ratio;
extern int sysctl_panic_on_oom;
+extern char oom_victim_name[];
extern int sysctl_oom_kill_allocating_task;
extern int sysctl_oom_dump_tasks;
extern int max_threads;
@@ -1185,6 +1186,15 @@ static struct ctl_table vm_table[] = {
.extra2 = &one,
},
#endif
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "oom_victim",
+ .data = oom_victim_name,
+ .maxlen = TASK_COMM_LEN,
+ .mode = 0644,
+ .proc_handler = &proc_dostring,
+ .strategy = &sysctl_string,
+ },
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a0a0190..12419f5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -28,6 +28,8 @@
#include <linux/memcontrol.h>
#include <linux/security.h>

+char oom_victim_name[TASK_COMM_LEN] = "Kenny";
+
int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks;
@@ -205,8 +207,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
struct task_struct *g, *p;
struct task_struct *chosen = NULL;
struct timespec uptime;
+ char *name = oom_victim_name;
*ppoints = 0;

+again:
do_posix_clock_monotonic_gettime(&uptime);
do_each_thread(g, p) {
unsigned long points;
@@ -223,6 +227,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
if (mem && !task_in_mem_cgroup(p, mem))
continue;

+ if (name && !strstr(p->comm, name))
+ continue;
+
/*
* This task already has access to memory reserves and is
* being killed. Don't allow any other task access to the
@@ -263,6 +270,15 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
}
} while_each_thread(g, p);

+ /*
+ * We did not find the process with requested string in its name,
+ * so lets search for the usual victim.
+ */
+ if (name && !chosen) {
+ name = NULL;
+ goto again;
+ }
+
return chosen;
}



--
Evgeniy Polyakov


2009-01-12 15:45:25

by Dave Jones

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Mon, Jan 12, 2009 at 06:33:05PM +0300, Evgeniy Polyakov wrote:
> Hi.
>
> Do you want to own a tame killer? Do you want to control the world?
>
> Start with your computer now and own the planet next: you already have
> an OOM-killer in the Linux to kill for you. But to date it was quite
> berserk and usually killed not what you would like him to murder.
>
> Now you can add a name of the victims, which will be checked by the
> oom-killer, who select the process to kill first among the ones which
> have given string in their executable name.
>
> By default the process to be killed is called 'Kenny', and if you like
> him, change then name by calling

I realise it ruins the joke, and it sounds unlikely, but anyone who
happens to have a process called 'Kenny' might be unpleasantly surprised
by this.

If we merge this feature, I think it should default to just using the
existing heuristic.

Dave

--
http://www.codemonkey.org.uk

2009-01-12 15:48:40

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

Hi Dave.

On Mon, Jan 12, 2009 at 10:44:56AM -0500, Dave Jones ([email protected]) wrote:
> > Do you want to own a tame killer? Do you want to control the world?
> >
> > Start with your computer now and own the planet next: you already have
> > an OOM-killer in the Linux to kill for you. But to date it was quite
> > berserk and usually killed not what you would like him to murder.
> >
> > Now you can add a name of the victims, which will be checked by the
> > oom-killer, who select the process to kill first among the ones which
> > have given string in their executable name.
> >
> > By default the process to be killed is called 'Kenny', and if you like
> > him, change then name by calling
>
> I realise it ruins the joke, and it sounds unlikely, but anyone who
> happens to have a process called 'Kenny' might be unpleasantly surprised
> by this.
>
> If we merge this feature, I think it should default to just using the
> existing heuristic.

Well, Kenny has to die, but if we still decide to change the world, here
is the fist step.

--- ./mm/oom_kill.c~ 2009-01-12 17:51:23.000000000 +0300
+++ ./mm/oom_kill.c 2009-01-12 18:48:04.000000000 +0300
@@ -28,7 +28,7 @@
#include <linux/memcontrol.h>
#include <linux/security.h>

-char oom_victim_name[TASK_COMM_LEN] = "Kenny";
+char oom_victim_name[TASK_COMM_LEN] = "";

int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;


--
Evgeniy Polyakov

2009-01-12 15:49:54

by Alan

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Mon, 12 Jan 2009 18:33:05 +0300
Evgeniy Polyakov <[email protected]> wrote:

> Hi.
>
> Do you want to own a tame killer? Do you want to control the world?

We've got /proc/*/oom_adj already

2009-01-12 15:50:41

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Mon, Jan 12, 2009 at 03:49:22PM +0000, Alan Cox ([email protected]) wrote:
> > Do you want to own a tame killer? Do you want to control the world?
>
> We've got /proc/*/oom_adj already

Which has to be checked for every process ever created,
which is quite unfeasible in some conditions.

--
Evgeniy Polyakov

2009-01-12 15:51:42

by Alan

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

> Well, Kenny has to die, but if we still decide to change the world, here
> is the fist step.

NAK this entire thing - we have an existing interface that does the job
far better.

2009-01-12 15:52:24

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Mon, Jan 12, 2009 at 03:51:08PM +0000, Alan Cox ([email protected]) wrote:
> > Well, Kenny has to die, but if we still decide to change the world, here
> > is the fist step.
>
> NAK this entire thing - we have an existing interface that does the job
> far better.

Modulo the fact that it does not work for the quickly created processes
which do not have their oom scores adjusted before the oom.

--
Evgeniy Polyakov

2009-01-12 15:53:08

by Alan

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Mon, 12 Jan 2009 18:50:30 +0300
Evgeniy Polyakov <[email protected]> wrote:

> On Mon, Jan 12, 2009 at 03:49:22PM +0000, Alan Cox ([email protected]) wrote:
> > > Do you want to own a tame killer? Do you want to control the world?
> >
> > We've got /proc/*/oom_adj already
>
> Which has to be checked for every process ever created,
> which is quite unfeasible in some conditions.

The task name is not a reliable indicator of true name and truncated so
is useless. You only nominate one task, you don't integrate with the
existing interface.

What you actually need is notifiers to work on /proc (exactly the same as
we need to avoid the bogus waitfd crap). At that point you can implement
arbitary policy by using dnotify/inotify/etc on /proc.

Alan

2009-01-12 15:56:25

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Mon, Jan 12, 2009 at 03:52:39PM +0000, Alan Cox ([email protected]) wrote:
> > Which has to be checked for every process ever created,
> > which is quite unfeasible in some conditions.
>
> The task name is not a reliable indicator of true name and truncated so
> is useless. You only nominate one task, you don't integrate with the
> existing interface.

Not one, but tasks which have the given string in the name. Like script
names spawned at DoS time.

> What you actually need is notifiers to work on /proc (exactly the same as
> we need to avoid the bogus waitfd crap). At that point you can implement
> arbitary policy by using dnotify/inotify/etc on /proc.

Yes, it could be done. If inotify will not be killed itself, will be
enabled in the config and daemon will be started.
But right now there is no way to solve that task, in the long term this
is a good idea to implement modulo security problems it may concern.

--
Evgeniy Polyakov

2009-01-12 16:19:30

by Alan

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

> Yes, it could be done. If inotify will not be killed itself, will be
> enabled in the config and daemon will be started.
> But right now there is no way to solve that task, in the long term this
> is a good idea to implement modulo security problems it may concern.

It is perfectly soluble right now, use the existing /proc interface. If
you want to specifically victimise new tasks first then set everything
else with an adjust *against* being killed and new stuff will start off
as cannon fodder until classified.

The name approach is the wrong way to handle this. It has no reflection
of heirarchy of process, targetting by users, containers etc..

In fact containers are probably the right way to do it

2009-01-12 16:22:39

by Dave Jones

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Mon, Jan 12, 2009 at 06:56:15PM +0300, Evgeniy Polyakov wrote:
> On Mon, Jan 12, 2009 at 03:52:39PM +0000, Alan Cox ([email protected]) wrote:
> > > Which has to be checked for every process ever created,
> > > which is quite unfeasible in some conditions.
> >
> > The task name is not a reliable indicator of true name and truncated so
> > is useless. You only nominate one task, you don't integrate with the
> > existing interface.
>
> Not one, but tasks which have the given string in the name. Like script
> names spawned at DoS time.

There is also the problem that process names aren't unique.
If the process table contains two entries called 'Kenny', there's nothing
that says they came from the same executable.

Dave

--
http://www.codemonkey.org.uk

2009-01-12 16:28:32

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Mon, Jan 12, 2009 at 11:22:09AM -0500, Dave Jones ([email protected]) wrote:
> > Not one, but tasks which have the given string in the name. Like script
> > names spawned at DoS time.
>
> There is also the problem that process names aren't unique.
> If the process table contains two entries called 'Kenny', there's nothing
> that says they came from the same executable.

Agree, oom-killer will try to get theirs points and if they are really
different applications, the 'bad' one will be killed.

--
Evgeniy Polyakov

2009-01-12 16:29:49

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Mon, Jan 12, 2009 at 04:19:31PM +0000, Alan Cox ([email protected]) wrote:
> > Yes, it could be done. If inotify will not be killed itself, will be
> > enabled in the config and daemon will be started.
> > But right now there is no way to solve that task, in the long term this
> > is a good idea to implement modulo security problems it may concern.
>
> It is perfectly soluble right now, use the existing /proc interface. If
> you want to specifically victimise new tasks first then set everything
> else with an adjust *against* being killed and new stuff will start off
> as cannon fodder until classified.
>
> The name approach is the wrong way to handle this. It has no reflection
> of heirarchy of process, targetting by users, containers etc..
>
> In fact containers are probably the right way to do it

Containers to solve oom-killer selection problem? :)

Being more serious, I agree that having a simple name does not solve the
problem if observed from any angle, but it is not the main goal.
Patch solves oom-killer selection issue from likely the most commonly
used case: when you know who should be checked and killed first when
problem appears.

--
Evgeniy Polyakov

2009-01-12 21:30:26

by Chris Snook

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

Evgeniy Polyakov wrote:
> On Mon, Jan 12, 2009 at 03:51:08PM +0000, Alan Cox ([email protected]) wrote:
>>> Well, Kenny has to die, but if we still decide to change the world, here
>>> is the fist step.
>> NAK this entire thing - we have an existing interface that does the job
>> far better.
>
> Modulo the fact that it does not work for the quickly created processes
> which do not have their oom scores adjusted before the oom.
>

cgroups solve this problem much more cleanly.

-- Chris

2009-01-12 21:42:40

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Mon, Jan 12, 2009 at 04:29:10PM -0500, Chris Snook ([email protected]) wrote:
> >Modulo the fact that it does not work for the quickly created processes
> >which do not have their oom scores adjusted before the oom.
>
> cgroups solve this problem much more cleanly.

When they are configured and enabled :)
And actually not, since having two separate groups still may result in
the wrong oom-killing, the same group should contain all potentially
'bad' processes, so that it could be triggered first and not the whole
scan.

Having a name to kill is way too simpler than anything else, and while
this may be not the finest grain solution, it is what is the most
obvious and the simplest to work with.

I do agree, that there are ways to solve the same problem, and likely
they provide better control, but setup/control cost is uncomparable with
simple name-based scheme to select 'victim' processes by their scores.

Effectively it is similar to oom_kill_allocating_task trick, which also
can be solved by adjusting oom-score for every other process in the
system or by putting it into the separate group, or anything else.
But still it is much simpler to have a single flag which solves the
problem maybe not optimally, but close to it in the most cases.

The same does my patch, which allows to select a set of processes by the
given string in the executable name, and then get a victim among them
based on the existing scores. This is the simplest and thus it could be
the most useful case.

--
Evgeniy Polyakov

2009-01-12 23:01:42

by Bill Davidsen

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

Evgeniy Polyakov wrote:
> On Mon, Jan 12, 2009 at 04:19:31PM +0000, Alan Cox ([email protected]) wrote:
>>> Yes, it could be done. If inotify will not be killed itself, will be
>>> enabled in the config and daemon will be started.
>>> But right now there is no way to solve that task, in the long term this
>>> is a good idea to implement modulo security problems it may concern.
>> It is perfectly soluble right now, use the existing /proc interface. If
>> you want to specifically victimise new tasks first then set everything
>> else with an adjust *against* being killed and new stuff will start off
>> as cannon fodder until classified.
>>
>> The name approach is the wrong way to handle this. It has no reflection
>> of heirarchy of process, targetting by users, containers etc..
>>
>> In fact containers are probably the right way to do it
>
> Containers to solve oom-killer selection problem? :)
>
> Being more serious, I agree that having a simple name does not solve the
> problem if observed from any angle, but it is not the main goal.
> Patch solves oom-killer selection issue from likely the most commonly
> used case: when you know who should be checked and killed first when
> problem appears.
>
The only cases in which this would really be useful is when running some
software which once in a great while goes super prompt critical and starts
throwing processes of a known name format in all directions, or when you have a
problem and know the process names involved before OOM kills everything in sight.

This does have a strange attraction, I did save the patch in case another "every
few years" problem comes up.

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2009-01-12 23:17:39

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Mon, Jan 12, 2009 at 06:00:10PM -0500, Bill Davidsen ([email protected]) wrote:
> >Being more serious, I agree that having a simple name does not solve the
> >problem if observed from any angle, but it is not the main goal.
> >Patch solves oom-killer selection issue from likely the most commonly
> >used case: when you know who should be checked and killed first when
> >problem appears.
> >
> The only cases in which this would really be useful is when running some
> software which once in a great while goes super prompt critical and starts
> throwing processes of a known name format in all directions, or when you
> have a problem and know the process names involved before OOM kills
> everything in sight.

Like anything that spawns a thread or process per request/client, or
preallocates set of them which connect to the huge object like database.
Most of the time database/server is killed first instead of comparably
small clients. In some cases it is possible to tune the environment, in
others it is not that simple. This patch works for such situatons
perfectly and does not require additional administrative burden, since
it does not make thinge worse as a whole, but only better for the very
commonly used cases, that's why I propose it for inclusion.

--
Evgeniy Polyakov

2009-01-13 02:08:38

by David Rientjes

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, 13 Jan 2009, Evgeniy Polyakov wrote:

> Like anything that spawns a thread or process per request/client, or
> preallocates set of them which connect to the huge object like database.
> Most of the time database/server is killed first instead of comparably
> small clients.

No, the reverse is true: when a task is chosen for oom kill based on the
badness heuristic, the oom killer first attempts to kill any child task
that isn't attached to the same mm. If the child shares an mm, both tasks
must die before memory freeing can occur.

> In some cases it is possible to tune the environment, in
> others it is not that simple. This patch works for such situatons
> perfectly and does not require additional administrative burden, since
> it does not make thinge worse as a whole, but only better for the very
> commonly used cases, that's why I propose it for inclusion.
>

It's an inappropriate addition since /proc/pid/oom_adj scores exist which
can prefer or protect certain tasks over others when the oom killer
chooses a target, including oom kill immunity. These scores are inherited
from parent tasks and can be tuned after the fork to your oom kill target
preference.

2009-01-13 08:52:56

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Mon, Jan 12, 2009 at 05:53:47PM -0800, David Rientjes ([email protected]) wrote:
> On Tue, 13 Jan 2009, Evgeniy Polyakov wrote:
>
> > Like anything that spawns a thread or process per request/client, or
> > preallocates set of them which connect to the huge object like database.
> > Most of the time database/server is killed first instead of comparably
> > small clients.
>
> No, the reverse is true: when a task is chosen for oom kill based on the
> badness heuristic, the oom killer first attempts to kill any child task
> that isn't attached to the same mm. If the child shares an mm, both tasks
> must die before memory freeing can occur.

It is a theory, not a practice. OOM-killer most of time starts from ssh,
database and lighttpd on the tested machines, when it could start in
the reverse order and do not touch ssh at all. Better not from daemon
itself, but its fastcgi spawned processes.

> > In some cases it is possible to tune the environment, in
> > others it is not that simple. This patch works for such situatons
> > perfectly and does not require additional administrative burden, since
> > it does not make thinge worse as a whole, but only better for the very
> > commonly used cases, that's why I propose it for inclusion.
> >
>
> It's an inappropriate addition since /proc/pid/oom_adj scores exist which
> can prefer or protect certain tasks over others when the oom killer
> chooses a target, including oom kill immunity. These scores are inherited
> from parent tasks and can be tuned after the fork to your oom kill target
> preference.

I agree, that there are ways to tune the way oom-killer selects the
victim, and likely after hours of games this subtly will work for the
specified workload. What I propose is the simplest way for the most
commonly used case. It is a help for the admin and not the force to
invent complex machinery which will be error-prone and hard to debug
when eventually oom happens. This will work, but it is way more complex
than what I propose, without immediately visible net effects on other
parts of the originally balanced system.

--
Evgeniy Polyakov

2009-01-13 09:54:34

by David Rientjes

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, 13 Jan 2009, Evgeniy Polyakov wrote:

> It is a theory, not a practice. OOM-killer most of time starts from ssh,
> database and lighttpd on the tested machines, when it could start in
> the reverse order and do not touch ssh at all. Better not from daemon
> itself, but its fastcgi spawned processes.
>

In the unconstrained system-wide oom case, it scans each task on the
system (which can take very long, ask SGI) and rates its badness scoring.
When a memory-hogging task is identified, which you have complete control
over in userspace by tuning /proc/pid/oom_adj, it attempts to kill a child
first if it will allow for memory freeing without killing the parent.

> I agree, that there are ways to tune the way oom-killer selects the
> victim, and likely after hours of games this subtly will work for the
> specified workload.

It doesn't involve "hours of games," it is a very simple heuristic that
you can easily tune to specify your preferences.

What you're looking for with your patch is simply a way to specify an oom
preference before the task has been forked, but that's simple to do with
the current logic since oom_adj scores are inherited and preference is
given to killing a child before parent.

> What I propose is the simplest way for the most
> commonly used case.

No, procfs is the correct interface for tuning oom kill preferences and
not by name parsing.

With oom_adj scores, you have the ability to specify oom kill preferences
within a cpuset or memory controller as well, whereas oom_victim_name is
global and very costly when not found in select_bad_process().

> It is a help for the admin and not the force to
> invent complex machinery which will be error-prone and hard to debug
> when eventually oom happens.

It's very simple to debug the oom killer's decisions, which is why I
introduced /proc/sys/vm/oom_dump_tasks.

It also requires two expensive scans of the entire tasklist (I introduced
/proc/sys/vm/oom_kill_allocating_task specifically to avoid _one_
expensive scan) when oom_victim_name isn't found.

2009-01-13 11:09:47

by Tomasz Chmielewski

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

>> On Tue, 13 Jan 2009, Evgeniy Polyakov wrote:
>>
>> > Like anything that spawns a thread or process per request/client, or
>> > preallocates set of them which connect to the huge object like database.
>> > Most of the time database/server is killed first instead of comparably
>> > small clients.
>>
>> No, the reverse is true: when a task is chosen for oom kill based on the
>> badness heuristic, the oom killer first attempts to kill any child task
>> that isn't attached to the same mm. If the child shares an mm, both tasks
>> must die before memory freeing can occur.
>
> It is a theory, not a practice. OOM-killer most of time starts from ssh,
> database and lighttpd on the tested machines, when it could start in
> the reverse order and do not touch ssh at all. Better not from daemon
> itself, but its fastcgi spawned processes.

How does this feature relate to:

config ANDROID_LOW_MEMORY_KILLER
bool "Android Low Memory Killer"
default N
---help---
Register processes to be killed when memory is low

available in Staging drivers / Android?


--
Tomasz Chmielewski
http://wpkg.org

2009-01-13 11:54:20

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 01:54:02AM -0800, David Rientjes ([email protected]) wrote:
> > It is a theory, not a practice. OOM-killer most of time starts from ssh,
> > database and lighttpd on the tested machines, when it could start in
> > the reverse order and do not touch ssh at all. Better not from daemon
> > itself, but its fastcgi spawned processes.
>
> In the unconstrained system-wide oom case, it scans each task on the
> system (which can take very long, ask SGI) and rates its badness scoring.
> When a memory-hogging task is identified, which you have complete control
> over in userspace by tuning /proc/pid/oom_adj, it attempts to kill a child
> first if it will allow for memory freeing without killing the parent.

Should this explain why ssh is killed?

> > I agree, that there are ways to tune the way oom-killer selects the
> > victim, and likely after hours of games this subtly will work for the
> > specified workload.
>
> It doesn't involve "hours of games," it is a very simple heuristic that
> you can easily tune to specify your preferences.
>
> What you're looking for with your patch is simply a way to specify an oom
> preference before the task has been forked, but that's simple to do with
> the current logic since oom_adj scores are inherited and preference is
> given to killing a child before parent.

It is very subtle approach. Consider the case when you have a pool of
threads/processes which are created and released on demand, there are
several such pools for different servers and you do know which one
will very likely being guilty.

Who should adjust the scores for newly created processes? Who should
check that processes in the first group have negative oom ajustment and
in the second group a positive value? Who determines when its time to
ajust the scores?

> > What I propose is the simplest way for the most
> > commonly used case.
>
> No, procfs is the correct interface for tuning oom kill preferences and
> not by name parsing.
>
> With oom_adj scores, you have the ability to specify oom kill preferences
> within a cpuset or memory controller as well, whereas oom_victim_name is
> global and very costly when not found in select_bad_process().
>
> > It is a help for the admin and not the force to
> > invent complex machinery which will be error-prone and hard to debug
> > when eventually oom happens.
>
> It's very simple to debug the oom killer's decisions, which is why I
> introduced /proc/sys/vm/oom_dump_tasks.
>
> It also requires two expensive scans of the entire tasklist (I introduced
> /proc/sys/vm/oom_kill_allocating_task specifically to avoid _one_
> expensive scan) when oom_victim_name isn't found.

It is not really costly, since most of the time we skip an entry and do
not lock the task and do not calculate its badness value. No one scares
that 'ps ax' is costly because it has to run through all the processes.

Messing with the scores is actually more expensive since we have to lock
the task and perform a calculus. I do not say it is wrong, but it is
much more complex task to being stable compared to simple task selection
by its name. It is what is used and what people expect to have (that's
actually why it was implemented :) and not to write some daemons to
monitor the clients and appropriate processes or change the code of the
servers.

--
Evgeniy Polyakov

2009-01-13 12:10:39

by Alan

[permalink] [raw]
Subject: ANDROID low memory killer register (was Linux killed Kenny, bastard!)

> config ANDROID_LOW_MEMORY_KILLER
> bool "Android Low Memory Killer"
> default N
> ---help---
> Register processes to be killed when memory is low
>
> available in Staging drivers / Android?

I believe they are related in the sense that both of them are
unneccessary and should not make the final kernel...

The /proc../oom_adj interface is again sufficient for this.

2009-01-13 12:15:44

by Alan

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

> Who should adjust the scores for newly created processes? Who should
> check that processes in the first group have negative oom ajustment and
> in the second group a positive value? Who determines when its time to
> ajust the scores?

This is policy. Where does policy go ? User space

2009-01-13 12:20:53

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 11:58:38AM +0100, Tomasz Chmielewski ([email protected]) wrote:
> config ANDROID_LOW_MEMORY_KILLER
> bool "Android Low Memory Killer"
> default N
> ---help---
> Register processes to be killed when memory is low
>
> available in Staging drivers / Android?

It looks very similar to vanilla process selection to be killed, but it
gets into account only rss size before the oom parameters.

--
Evgeniy Polyakov

2009-01-13 12:24:18

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: ANDROID low memory killer register (was Linux killed Kenny, bastard!)

On Tue, Jan 13, 2009 at 12:10:38PM +0000, Alan Cox ([email protected]) wrote:
> I believe they are related in the sense that both of them are
> unneccessary and should not make the final kernel...
>
> The /proc../oom_adj interface is again sufficient for this.

Besides the fact that it will unlikely to work the way you expect, since
oom adjustment is racy and may be too late, or not happend at all...

Alan, please try to apply this to practice and then you will suddenly
find that it would be really great to have a different method, which
will just work.

--
Evgeniy Polyakov

2009-01-13 12:27:18

by Alan

[permalink] [raw]
Subject: Re: ANDROID low memory killer register (was Linux killed Kenny, bastard!)

On Tue, 13 Jan 2009 15:23:56 +0300
Evgeniy Polyakov <[email protected]> wrote:

> On Tue, Jan 13, 2009 at 12:10:38PM +0000, Alan Cox ([email protected]) wrote:
> > I believe they are related in the sense that both of them are
> > unneccessary and should not make the final kernel...
> >
> > The /proc../oom_adj interface is again sufficient for this.
>
> Besides the fact that it will unlikely to work the way you expect, since
> oom adjustment is racy and may be too late, or not happend at all...

I would love to see a proposal that met such criteria. I don't think I
ever will and your random name based hack isn't it. The closest we have
right now is the container stuff.

Alan

2009-01-13 12:29:30

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 12:15:10PM +0000, Alan Cox ([email protected]) wrote:
> > Who should adjust the scores for newly created processes? Who should
> > check that processes in the first group have negative oom ajustment and
> > in the second group a positive value? Who determines when its time to
> > ajust the scores?
>
> This is policy. Where does policy go ? User space

Don't you notice how many 'who' were placed and only single 'user space'
answer? Becasue it is not an answer, it is a theoretical POV, which does
not really work in practice, since it is way too unconvenient and
error-prone, and actually it does not work when needed, since because of
its complexity something will be missed. I've just talked with the
admins who originally requested 'kill-by-name' feature why they did not
work with /proc/.../oom_adj, and got a nice answer: we tries, but
likely something went wrong and it did not work the way we wanted.

There is no way to know that adjustment is correct, that everything was
uptodate when oom happend, that nothing was forgotten and practice shows
that there are always such problems and invalid tasks are killed.

When you put a name you do know that it works, since it is only single
place to be updated and no need to bother with ugly tools or changes
especially to handle short-living processes.

--
Evgeniy Polyakov

2009-01-13 13:19:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

Instead of trying to specify which process should be protected from
the OOM killer by name, how about something which is inherited from
the parent process? After all, if having the child not get killed due
to OOM is important, the child won't even have a chance to run if the
parent gets killed off. And in fact, we have something that fits that
bill fairly well; getrlimit()/setrlimit(). Why not define a new
resource limit which specifies a relative immunity to the oom_killer?

Most of the infrastructure to support that will already be in place
(i.e., shell support, PAM support in /etc/securitylimits.conf); all
that would need to be done is to teach a few userspace
programs/libraries about the new resource limit.

This would be a much cleaner approach, I would think.

Regards,

- Ted

2009-01-13 13:36:15

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 08:19:37AM -0500, Theodore Tso ([email protected]) wrote:
> Instead of trying to specify which process should be protected from
> the OOM killer by name, how about something which is inherited from
> the parent process? After all, if having the child not get killed due
> to OOM is important, the child won't even have a chance to run if the
> parent gets killed off. And in fact, we have something that fits that
> bill fairly well; getrlimit()/setrlimit(). Why not define a new
> resource limit which specifies a relative immunity to the oom_killer?
>
> Most of the infrastructure to support that will already be in place
> (i.e., shell support, PAM support in /etc/securitylimits.conf); all
> that would need to be done is to teach a few userspace
> programs/libraries about the new resource limit.
>
> This would be a much cleaner approach, I would think.

It will be similar to oom_adj parameter (although I did not find where
it is inherited from the parent), but with the different updating
interface. I do not think it will be anyhow easier to solve the problem,
since it is not directly in the parent/child hierarchy, since there are
cases when we do want to kill children (this phrase just screams for the
addition: and eat them), but only some processes which are not really
the most significant.

Existing oom score adjustment mechanism works for this cases, but it is
by itself is not convenient to be used. Even its documentation does not
say how it is used :) It is not just simple add/remove, but score
multiplication or division by the two in the power of the oom_adj value.
Plus really no one knows how scores are calculated except those who read
the mm/kill.c before going to sleep.

So effectively oom_adj only works as enable/disable switch, and since no
one knows how to tune it, it is better to do not touch at all. And get
ssh killed. I believe if it is ever used then only to disable oom at
all, which is wrong, since task still may be killed but after some
others. My patch adds a simple priority for that based on the name of
the process, which are known to the administrators who maintain given
system.

--
Evgeniy Polyakov

2009-01-13 13:48:34

by Alan

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

> (i.e., shell support, PAM support in /etc/securitylimits.conf); all
> that would need to be done is to teach a few userspace
> programs/libraries about the new resource limit.

You don't even need that - just define the behaviour of oom_adj to
inherit.

Of course thats still often totally the wrong behaviour as you'll find out
when a key system service is started up by a client that wants to use it
that was run by some low privilege untrusted process with tight resource
limits.

For desktop Jim Gettys proposed a very simple and quite elegant use of
this sort of thing which was to let the window manager do some of the
work according to what hadn't been used for ages.

2009-01-13 13:53:34

by Jan-Frode Myklebust

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On 2009-01-13, David Rientjes <[email protected]> wrote:
>
> When a memory-hogging task is identified, which you have complete control
> over in userspace by tuning /proc/pid/oom_adj, it attempts to kill a child
> first if it will allow for memory freeing without killing the parent.

So an alternative to Evgeniy Polyakov's patch would be:

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index d105eb4..5dcfc88 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -2311,10 +2311,19 @@ increase the likelihood of this process being killed by the oom-killer. Valid
values are in the range -16 to +15, plus the special value -17, which disables
oom-killing altogether for this process.

+Child processes will inherit the parent oom_score, so to launch a potential
+rouge process that you want to be the primary target of the oom-killer, that
+can be done by adjusting the score of the parent process, before launching the
+potential rouge process. F.ex. to make sure the process "Kenny" will be a
+prime candidate to get killed:
+
+ echo 15 > /proc/self/oom_adj
+ ./Kenny
+ echo -15 > /proc/self/oom_adj
+
2.13 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------

-------------------------------------------------------------------------------
This file can be used to check the current score used by the oom-killer is for
any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
process should be killed in an out-of-memory situation.

2009-01-13 13:53:53

by Evgeniy Polyakov

[permalink] [raw]
Subject: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Mon, Jan 12, 2009 at 03:51:08PM +0000, Alan Cox ([email protected]) wrote:
> > Well, Kenny has to die, but if we still decide to change the world, here
> > is the fist step.
>
> NAK this entire thing - we have an existing interface that does the job
> far better.

Mwahaha, I just checked how scores are calculated, so that userspace
could adjust them. Let's start with beginning:

list_for_each_entry(child, &p->children, sibling) {
task_lock(child);
if (child->mm != mm && child->mm)
points += child->mm->total_vm/2 + 1;
task_unlock(child);
}

/*
* CPU time is in tens of seconds and run time is in thousands
* of seconds. There is no particular reason for this other than
* that it turned out to work very well in practice.
*/
cpu_time = (cputime_to_jiffies(p->utime) + cputime_to_jiffies(p->stime))
>> (SHIFT_HZ + 3);

if (uptime >= p->start_time.tv_sec)
run_time = (uptime - p->start_time.tv_sec) >> 10;
else
run_time = 0;

s = int_sqrt(cpu_time);
if (s)
points /= s;
s = int_sqrt(int_sqrt(run_time));
if (s)
points /= s;

Do you _REALLY_ think anyone can calculate it yourself and then properly
calculate adjustment used to properly select oom-killed process?

I can not and will not even try if I would be an admin of the given
system. So, Alan, until you can calc that numbers in mind and then do
this for the whole heavy loaded system, please do not spread the idea
that oom_adj can be used to tune the oom-killer.
And no, reading data from /proc/.../oom_score is not enough, since
they change with time, so the same will be needed to be done to tune
the adjustment?

So far my patch is the sanest way to deal with the OOM selection, when
we have to differentiate some processes. I agree, it is not the best
solution, but it is way ahead of what we have right now for the users
and not hardcore kernel hackers.

--
Evgeniy Polyakov

2009-01-13 13:59:30

by Alan

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

> +potential rouge process. F.ex. to make sure the process "Kenny" will be a
> +prime candidate to get killed:
> +
> + echo 15 > /proc/self/oom_adj
> + ./Kenny
> + echo -15 > /proc/self/oom_adj
> +

This is a bogus and silly example - but it does show why the whole thing
is not needed

(echo "15" >/proc/self/oom_adj; exec ./Kenny) &

See it's like nice, you can do it yourself anyway if you are
co-operating, and if you aren't co-operating you need containers anyway...

Alan

2009-01-13 14:07:20

by Alan

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

> Do you _REALLY_ think anyone can calculate it yourself and then properly
> calculate adjustment used to properly select oom-killed process?

Its always a heuristic.

> So far my patch is the sanest way to deal with the OOM selection

No. You keep maintaining this but your crude hack is useless in a non
co-operative environment, has lots of issue with name aliasing and
doesn't deal with real needs.

We have container interfaces that can do this and far more and do them
right. In fact the very start of all the OpenVZ and container work years
ago was the beancounter patches which were addressed at exactly this
problem (although more specifically 'making sure undergraduates processes
get killed first')

Alan

2009-01-13 14:24:37

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 02:06:27PM +0000, Alan Cox ([email protected]) wrote:
> > Do you _REALLY_ think anyone can calculate it yourself and then properly
> > calculate adjustment used to properly select oom-killed process?
>
> Its always a heuristic.

For the system which knows what it is. User does not and really can not
work with it, since there is no sane way to implement that heuristic in
the applications or even in (theoretically possible) monitor daemon.

So, effectively, oom adjustment does not work.

> > So far my patch is the sanest way to deal with the OOM selection
>
> No. You keep maintaining this but your crude hack is useless in a non
> co-operative environment, has lots of issue with name aliasing and
> doesn't deal with real needs.

It is created because of real needs. Because people need to control the
behaviour of the system and they want to control which application will
be killed to free the memory. Attached patch is not the best solution,
but it works for the all cases I can think about.

Let's take you 'name aliasing' claim: if there are several processes
with the same name, system will select the one with the worst score
according to the own magical algorithm. So it will not kill random
process just because it happend to have ricky name.

And the same applies to the other issues. It just helps system to select
the process to be killed according to userspace expectation of what
should be killed to free the memory.

> We have container interfaces that can do this and far more and do them
> right. In fact the very start of all the OpenVZ and container work years
> ago was the beancounter patches which were addressed at exactly this
> problem (although more specifically 'making sure undergraduates processes
> get killed first')

Are the beancounters used to limit amount of virtual ram and not the
physical one? It really does not work to limit for example some java
machine which will ate all virtual space swapping out different node.
It works for some (and likely the most, I do not argue this) cases and
has overhead. But we are talking not about how to limit the processes,
but what to do when we happend to have out-of-memory condition. And it
happens all the time even if you put the processes into the separate
container, since there are situations (that's why it was started at
first), when you have a huge process which should not be killed and set
of either its children or external processes, which should be checked
and some of them (administrator would like to specify the less
important) should be killed without much harm to the system.

And patch I presented allows to do it. It introduces a hint for the
killer on what processes should be checked first. It works exactly the
way people work with their system: they run different application and
expect some of them to be higher or lower priority when things come to
the oom condition. No one ever proposes to kill exactly the process we
select (although that may be a good idea in some cases), but instead to
show that oom-killer should check given group first. The group
administrator knows to be potentially harmless.

--
Evgeniy Polyakov

2009-01-13 15:00:31

by Balbir Singh

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 7:54 PM, Evgeniy Polyakov <[email protected]> wrote:
> On Tue, Jan 13, 2009 at 02:06:27PM +0000, Alan Cox ([email protected]) wrote:
>> > Do you _REALLY_ think anyone can calculate it yourself and then properly
>> > calculate adjustment used to properly select oom-killed process?
>>
>> Its always a heuristic.
>
> For the system which knows what it is. User does not and really can not
> work with it, since there is no sane way to implement that heuristic in
> the applications or even in (theoretically possible) monitor daemon.
>
> So, effectively, oom adjustment does not work.
>
>> > So far my patch is the sanest way to deal with the OOM selection
>>
>> No. You keep maintaining this but your crude hack is useless in a non
>> co-operative environment, has lots of issue with name aliasing and
>> doesn't deal with real needs.
>
> It is created because of real needs. Because people need to control the
> behaviour of the system and they want to control which application will
> be killed to free the memory. Attached patch is not the best solution,
> but it works for the all cases I can think about.
>

Where does this end? Tomorrow you'll add an interface for applications
that should *not* be killed? What sort of a heuristic is name? I think
the only name the kernel knows about is "init".

> Let's take you 'name aliasing' claim: if there are several processes
> with the same name, system will select the one with the worst score
> according to the own magical algorithm. So it will not kill random
> process just because it happend to have ricky name.
>

Having a name in the kernel is like building a hit-list, why can't the
examples that Alan sent work for you?
Names are tricky as well, if someone used a symbolic link to the
application with a different name, they would no longer be candidates
for OOM first? or vice-versa?

> And the same applies to the other issues. It just helps system to select
> the process to be killed according to userspace expectation of what
> should be killed to free the memory.
>
>> We have container interfaces that can do this and far more and do them
>> right. In fact the very start of all the OpenVZ and container work years
>> ago was the beancounter patches which were addressed at exactly this
>> problem (although more specifically 'making sure undergraduates processes
>> get killed first')
>
> Are the beancounters used to limit amount of virtual ram and not the
> physical one? It really does not work to limit for example some java
> machine which will ate all virtual space swapping out different node.
> It works for some (and likely the most, I do not argue this) cases and
> has overhead. But we are talking not about how to limit the processes,
> but what to do when we happend to have out-of-memory condition. And it
> happens all the time even if you put the processes into the separate
> container, since there are situations (that's why it was started at
> first), when you have a huge process which should not be killed and set
> of either its children or external processes, which should be checked
> and some of them (administrator would like to specify the less
> important) should be killed without much harm to the system.
>
> And patch I presented allows to do it. It introduces a hint for the
> killer on what processes should be checked first. It works exactly the
> way people work with their system: they run different application and
> expect some of them to be higher or lower priority when things come to
> the oom condition. No one ever proposes to kill exactly the process we
> select (although that may be a good idea in some cases), but instead to
> show that oom-killer should check given group first. The group
> administrator knows to be potentially harmless.
>

You can replace the lines of kernel code you wrote with a simple
one-line script that Alan sent out.

Balbir

2009-01-13 15:21:20

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 08:30:16PM +0530, Balbir Singh ([email protected]) wrote:
> > It is created because of real needs. Because people need to control the
> > behaviour of the system and they want to control which application will
> > be killed to free the memory. Attached patch is not the best solution,
> > but it works for the all cases I can think about.
> >
>
> Where does this end? Tomorrow you'll add an interface for applications
> that should *not* be killed? What sort of a heuristic is name? I think
> the only name the kernel knows about is "init".

We have an interface to disable oom for the process already :)
But I could agree that it could be a good idea to have an interface
to provide a list of names or whatever else to select what user knows
and works with to be killed first/last

> > Let's take you 'name aliasing' claim: if there are several processes
> > with the same name, system will select the one with the worst score
> > according to the own magical algorithm. So it will not kill random
> > process just because it happend to have ricky name.
> >
>
> Having a name in the kernel is like building a hit-list, why can't the
> examples that Alan sent work for you?

Using oom_adj? Because there is no way I can determine which number to
put there. It is not even documented for those who do not read kernel
sources. Even after that: oom_score changes with time, and having 1/2 or
8 oom_adj is correct right now, it will not be in a few moments.

Having containers is a bit overkill to determine which one to kill,
especially when several sets of processes are created from the same
parent :)

> Names are tricky as well, if someone used a symbolic link to the
> application with a different name, they would no longer be candidates
> for OOM first? or vice-versa?

It is up to the user to decide what he wants to be checked first.
Only user knows what he runs.

> You can replace the lines of kernel code you wrote with a simple
> one-line script that Alan sent out.

Almost. But I can not if tasks are spawned from the parent process. We
can not change the process to adjust its forked children to have
different adjustment and can not change it for the process itself, since
it should live and children should be dead.

--
Evgeniy Polyakov

2009-01-13 16:36:44

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

Hi

sorry... I also don't like this patch.


> @@ -263,6 +270,15 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> }
> } while_each_thread(g, p);
>
> + /*
> + * We did not find the process with requested string in its name,
> + * so lets search for the usual victim.
> + */
> + if (name && !chosen) {
> + name = NULL;
> + goto again;
> + }
> +
> return chosen;

this patch makes oom handling slower.
slow bad process selection cause next another out of memory.

then, your trouble become large.


2009-01-13 18:05:24

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Tue, 13 Jan 2009 18:21:06 +0300, Evgeniy Polyakov said:

> Using oom_adj? Because there is no way I can determine which number to
> put there. It is not even documented for those who do not read kernel
> sources. Even after that: oom_score changes with time, and having 1/2 or
> 8 oom_adj is correct right now, it will not be in a few moments.

In that case, the *real* problem to be fixed is a lack of documentation.
It should be possible to add a blurb somewhere in Documentation/* that
says:

"echo 10000 > oom_adjust" is guaranteed to make this process the first one
up against the wall when the revolution comes (for some value of 10000, of
course).


Attachments:
(No filename) (226.00 B)

2009-01-13 19:16:24

by David Rientjes

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, 13 Jan 2009, Evgeniy Polyakov wrote:

> Should this explain why ssh is killed?
>

If you would like to make sshd immune from the oom killer, use

echo -17 > /proc/$(pidof sshd)/oom_adj

just like any other task. This score will be inherited by any task that
it executes, so you'll probably want to readjust your shell's oom_adj
score appropriately in your rc file.

> It is very subtle approach. Consider the case when you have a pool of
> threads/processes which are created and released on demand, there are
> several such pools for different servers and you do know which one
> will very likely being guilty.
>
> Who should adjust the scores for newly created processes? Who should
> check that processes in the first group have negative oom ajustment and
> in the second group a positive value? Who determines when its time to
> ajust the scores?
>

It is userspace's responsibility to set the policy, the kernel merely
provides the mechanism.

> > With oom_adj scores, you have the ability to specify oom kill preferences
> > within a cpuset or memory controller as well, whereas oom_victim_name is
> > global and very costly when not found in select_bad_process().
> >

You chose not to respond to this, which is a major flaw in your approach.

Your patch makes cpuset and memory controller oom killing much slower
because it requires two iterations through the system tasklist when your
global oom_victim_name task is either not running or in a disjoint cpuset
or memcg.

> It is not really costly, since most of the time we skip an entry and do
> not lock the task and do not calculate its badness value. No one scares
> that 'ps ax' is costly because it has to run through all the processes.
>

Talk to SGI about oom killer tasklist scans for their large systems; it
was a prerequisite for me to provide /proc/sys/vm/oom_kill_allocating_task
to avoid a single scan when I made cpuset-constrained ooms go through
select_bad_process().

2009-01-13 19:36:53

by David Rientjes

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, 13 Jan 2009, Evgeniy Polyakov wrote:

> Don't you notice how many 'who' were placed and only single 'user space'
> answer? Becasue it is not an answer, it is a theoretical POV, which does
> not really work in practice, since it is way too unconvenient and
> error-prone, and actually it does not work when needed, since because of
> its complexity something will be missed. I've just talked with the
> admins who originally requested 'kill-by-name' feature why they did not
> work with /proc/.../oom_adj, and got a nice answer: we tries, but
> likely something went wrong and it did not work the way we wanted.
>
> There is no way to know that adjustment is correct, that everything was
> uptodate when oom happend, that nothing was forgotten and practice shows
> that there are always such problems and invalid tasks are killed.
>
> When you put a name you do know that it works, since it is only single
> place to be updated and no need to bother with ugly tools or changes
> especially to handle short-living processes.
>

The goal of the oom killer is to kill a rogue memory hogging task, which
will lead to future memory freeing once the task dies, and allow the
system or container to resume normal operation.

You're not realizing the power of /proc/pid/oom_adj: it allows you to tune
the badness scoring so that YOU, the user, may determine what the
definition of 'rogue' is on a task-by-task basis.

Your patch simply allows users to specify a task by name that will always
be killed first when the oom killer is invoked. That's terribly
insufficient if another task uses an excessive amount of memory that you
didn't expect; a rogue task may be leaking memory and the task you've
identified by name with your patch is repeatedly forked and killed when
the rogue task goes untouched.

With oom_adj scores, you can easily specify at what point each task should
be considered rogue. You can elevate the oom_adj score for those you have
a preference to kill and reduce the oom_adj score for those that you'd
prefer being deferred _unless_ they get sufficiently out of hand.

Your patch presents a shortcut where the entire badness scoring (and,
thus, all oom_adj scores) is ignored if the named task exists. That not
only has syncronization issues, but also can cause the kernel to loop
forever in killing a task by the same name without ever freeing memory for
anything else.

Additionally, your patch completely breaks cpuset oom killing since
candidacy is determined in badness() because a task may have allocated
non-migrated memory elsewhere before being moved to a different cpuset.
Your oom_victim_name task may exist globally, but will always be
identified for oom kill even when the oom exists exclusively in a disjoint
cpuset. That does _not_ lead to future memory freeing that current can
use, and if the parent of the killed task decides to immediately fork
another instance, this cpuset will be completely livelocked.

2009-01-13 19:46:59

by David Rientjes

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Tue, 13 Jan 2009, Evgeniy Polyakov wrote:

> Using oom_adj? Because there is no way I can determine which number to
> put there. It is not even documented for those who do not read kernel
> sources. Even after that: oom_score changes with time, and having 1/2 or
> 8 oom_adj is correct right now, it will not be in a few moments.
>

Your oom_adj scores should never need to be changed unless you're tuning
the inherited value of a child; it simply represents your input into when
a specific task should be considered rogue enough to target.

However, patches to improve the documentation of the oom killer, or any
other kernel feature, are always welcome.

> > You can replace the lines of kernel code you wrote with a simple
> > one-line script that Alan sent out.
>
> Almost. But I can not if tasks are spawned from the parent process. We
> can not change the process to adjust its forked children to have
> different adjustment and can not change it for the process itself, since
> it should live and children should be dead.
>

Children are already preferred over the chosen parent task, as I've
explained a few times. When a task is identified for oom kill by the
badness heuristics, the oom killer attempts to kill a child that does not
share the same mm first, which is exactly what you're asking for here. If
the parent shares the mm, it needs to exit as well before memory freeing
may occur.

2009-01-13 21:33:27

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 11:46:14AM -0800, David Rientjes ([email protected]) wrote:
> Children are already preferred over the chosen parent task, as I've
> explained a few times. When a task is identified for oom kill by the
> badness heuristics, the oom killer attempts to kill a child that does not
> share the same mm first, which is exactly what you're asking for here. If
> the parent shares the mm, it needs to exit as well before memory freeing
> may occur.

I really did not investigate why it happend, but oom'ed machine had
killed cgi daemons and parent process itself. And ssh to the heap.
While it should be enough just to kill appropriate daemon. Apparently
things are not that shine as should be.

--
Evgeniy Polyakov

2009-01-13 21:41:20

by David Rientjes

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Wed, 14 Jan 2009, Evgeniy Polyakov wrote:

> I really did not investigate why it happend, but oom'ed machine had
> killed cgi daemons and parent process itself. And ssh to the heap.
> While it should be enough just to kill appropriate daemon. Apparently
> things are not that shine as should be.
>

As previously mentioned, you have all the diagnostic tools at your
disposal already:

echo 1 > /proc/sys/vm/oom_dump_tasks

The badness scoring is straight-forward given that information, so you can
diagnose why a specific task was not killed and another was chosen. You
can also use that information to appropriately tune the oom_adj scores to
identify your oom killer target preferences.

2009-01-13 21:46:40

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 11:36:04AM -0800, David Rientjes ([email protected]) wrote:
> The goal of the oom killer is to kill a rogue memory hogging task, which
> will lead to future memory freeing once the task dies, and allow the
> system or container to resume normal operation.
>
> You're not realizing the power of /proc/pid/oom_adj: it allows you to tune
> the badness scoring so that YOU, the user, may determine what the
> definition of 'rogue' is on a task-by-task basis.
>
> Your patch simply allows users to specify a task by name that will always
> be killed first when the oom killer is invoked. That's terribly
> insufficient if another task uses an excessive amount of memory that you
> didn't expect; a rogue task may be leaking memory and the task you've
> identified by name with your patch is repeatedly forked and killed when
> the rogue task goes untouched.

It is up to user to decide, exactly the same will happen if you tune the
oom_adj for the task.

> With oom_adj scores, you can easily specify at what point each task should
> be considered rogue. You can elevate the oom_adj score for those you have
> a preference to kill and reduce the oom_adj score for those that you'd
> prefer being deferred _unless_ they get sufficiently out of hand.

No, you can not. Did you try that? The only sane way to use oom_adj is
to disable oom killer for the task or make its score very small or very
big, there is really no way to make a finegrained tuning, since score
changes and userspace does not know the algorithm.

> Your patch presents a shortcut where the entire badness scoring (and,
> thus, all oom_adj scores) is ignored if the named task exists. That not
> only has syncronization issues, but also can cause the kernel to loop
> forever in killing a task by the same name without ever freeing memory for
> anything else.

No, that's not what it does. Patch allows to select process with the
highest score among those who have appropriate name. Please check it
twice.

> Additionally, your patch completely breaks cpuset oom killing since
> candidacy is determined in badness() because a task may have allocated
> non-migrated memory elsewhere before being moved to a different cpuset.
> Your oom_victim_name task may exist globally, but will always be
> identified for oom kill even when the oom exists exclusively in a disjoint
> cpuset. That does _not_ lead to future memory freeing that current can
> use, and if the parent of the killed task decides to immediately fork
> another instance, this cpuset will be completely livelocked.

Please check the patch first. It selects process according to the
badness, check memory group first and fallbacks to scan other processes
if process with the given name was not found or name is null.

User does not work with the some magically calculated scores, he just
starts the processes and knows only their names. User can specify pid,
but in the case of short-living connections it is not possible. Changing
parent oom score opens a huge possibility to kill it, while in case of
some application server (or database) it should never be killed, and
only some of its clients (which work for the users and not for the
calculating backend for example) have to be killed.

There is no way to implement it with short-living pids (they are
unknown, if inotify worked with the /proc it could be doable though,
except that special daemon is needed) and can not change the parent's
score as was suggested.

You claim that existing scheme works, but in practice it does not.
So I created a patch which somehow makes the solution closer. It is not
perfect, but it works compared to what was suggested from the
theoretical point of view.

--
Evgeniy Polyakov

2009-01-13 22:01:16

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 11:15:26AM -0800, David Rientjes ([email protected]) wrote:
> > Should this explain why ssh is killed?
>
> If you would like to make sshd immune from the oom killer, use
>
> echo -17 > /proc/$(pidof sshd)/oom_adj
>
> just like any other task. This score will be inherited by any task that
> it executes, so you'll probably want to readjust your shell's oom_adj
> score appropriately in your rc file.

For every process? What about short-living ones? Again, parent can not
be changed, since it has to stay, only its children should be killed
(and in some cases not eall, but only those from special set, like cgi
daemons started by the clients, and not database connections).

> > It is very subtle approach. Consider the case when you have a pool of
> > threads/processes which are created and released on demand, there are
> > several such pools for different servers and you do know which one
> > will very likely being guilty.
> >
> > Who should adjust the scores for newly created processes? Who should
> > check that processes in the first group have negative oom ajustment and
> > in the second group a positive value? Who determines when its time to
> > ajust the scores?
> >
>
> It is userspace's responsibility to set the policy, the kernel merely
> provides the mechanism.

Which does not work for the specified cases. There is no way to specify
the pid of the short-living processes and parent can not be changed (at
least for long enough time).

> > > With oom_adj scores, you have the ability to specify oom kill preferences
> > > within a cpuset or memory controller as well, whereas oom_victim_name is
> > > global and very costly when not found in select_bad_process().
> > >
>
> You chose not to respond to this, which is a major flaw in your approach.
>
> Your patch makes cpuset and memory controller oom killing much slower
> because it requires two iterations through the system tasklist when your
> global oom_victim_name task is either not running or in a disjoint cpuset
> or memcg.

It does exactly the same which happens for usual processes, it just
selects the ones with given name and then calculate badness and so on.
It has really nothing with memory group or cpuset, process is selected in
the given memory group.

> > It is not really costly, since most of the time we skip an entry and do
> > not lock the task and do not calculate its badness value. No one scares
> > that 'ps ax' is costly because it has to run through all the processes.
> >
>
> Talk to SGI about oom killer tasklist scans for their large systems; it
> was a prerequisite for me to provide /proc/sys/vm/oom_kill_allocating_task
> to avoid a single scan when I made cpuset-constrained ooms go through
> select_bad_process().

If user specifies the name of the process he knows what he is doing. It
is always possibble to set it to null and avoid second scan.
More on this: the loop with check inside and loop without it in sum
equal to the loop with check and other processing.
Which means that if I change the patch to select two process: one with
given name and another one among the others, it will take exactly the
same time and will not introduce second loop (module loop prefetch
optimisations). For example:

loop {
if (a)
do_something1();
}

loop {
do_something2();
}

equals to
loop {
if (a)
do_something1();
do_something2();
}

Getting amount of the checks in that loop already, another one to
compare several (and most of the time just one) letters in the
name does not add overhead. So this does not count.

What really counts is the fact, that so far it is while not perfect, but
working solution for the problem I described, and existing (and
proposed) methods do not work.

--
Evgeniy Polyakov

2009-01-13 22:04:18

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

Hi.

On Wed, Jan 14, 2009 at 01:35:36AM +0900, KOSAKI Motohiro ([email protected]) wrote:
> > + /*
> > + * We did not find the process with requested string in its name,
> > + * so lets search for the usual victim.
> > + */
> > + if (name && !chosen) {
> > + name = NULL;
> > + goto again;
> > + }
> > +
> > return chosen;
>
> this patch makes oom handling slower.
> slow bad process selection cause next another out of memory.
>
> then, your trouble become large.

It really does not. As I described, when task does not have a valid
name, all checks are skipped.

So effectively it equals to the additional check in the loop, which
although being non-zero, but really is very small, since most of the
time process does not match and only single letter of the name should be
checked.

--
Evgeniy Polyakov

2009-01-13 22:05:52

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 01:39:01PM -0800, David Rientjes ([email protected]) wrote:
> > I really did not investigate why it happend, but oom'ed machine had
> > killed cgi daemons and parent process itself. And ssh to the heap.
> > While it should be enough just to kill appropriate daemon. Apparently
> > things are not that shine as should be.
> >
>
> As previously mentioned, you have all the diagnostic tools at your
> disposal already:
>
> echo 1 > /proc/sys/vm/oom_dump_tasks
>
> The badness scoring is straight-forward given that information, so you can
> diagnose why a specific task was not killed and another was chosen. You
> can also use that information to appropriately tune the oom_adj scores to
> identify your oom killer target preferences.

There is no ssh there, I can not do any diagnostics. I first have to
change oom score for the ssh, but that's a different story.

--
Evgeniy Polyakov

2009-01-13 22:50:05

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Wed, Jan 14, 2009 at 12:46:27AM +0300, Evgeniy Polyakov wrote:
> User does not work with the some magically calculated scores, he just
> starts the processes and knows only their names. User can specify pid,
> but in the case of short-living connections it is not possible. Changing
> parent oom score opens a huge possibility to kill it, while in case of
> some application server (or database) it should never be killed, and
> only some of its clients (which work for the users and not for the
> calculating backend for example) have to be killed.

The standard way this gets handled for resource limits is very simple:

1) parent forks the child process
2) in the child process we set up resource limits, adjust oom
3) exec the child's program.

As Alan has already pointed out to you:

(echo XXXX > /proc/self/oom_adj ; exec /usr/bin/program)

There are two problems; one is whether or not the OOM protection is
inherited or not, and how one sets OOM protection --- and I think you
will find a huge resistance to using names as a way of expressing
policy.

The second problem is that oom_adj scoring is a hueristic which is
hard for system administrators to understand --- and these are
separable problems. Don't try to conflate them, and try using the
fact that a random score echo'ed into /proc/pid/oom_adj is hard to
tune as a justification for using process executable names.

If you want to argue that using containers is too hard, and there out
to be a simpler tuning parameter where (for the sake of argument) all
processes are given a number from 0 to 10, where 5 is the default, and
higher numbers will be picked unconditionally over lower numbers, and
the existing OOM score is used to distinguish between two process with
the same OOM protection, that's fine.

How we set that OOM protection class, whether it is via setrlimit() or
echoing into a magic /proc/pid/oom_protection file, and whether it
inherits across fork and exec calls, are a separate question.

- Ted

2009-01-13 23:02:54

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 05:49:41PM -0500, Theodore Tso ([email protected]) wrote:
> > User does not work with the some magically calculated scores, he just
> > starts the processes and knows only their names. User can specify pid,
> > but in the case of short-living connections it is not possible. Changing
> > parent oom score opens a huge possibility to kill it, while in case of
> > some application server (or database) it should never be killed, and
> > only some of its clients (which work for the users and not for the
> > calculating backend for example) have to be killed.
>
> The standard way this gets handled for resource limits is very simple:
>
> 1) parent forks the child process
> 2) in the child process we set up resource limits, adjust oom
> 3) exec the child's program.
>
> As Alan has already pointed out to you:
>
> (echo XXXX > /proc/self/oom_adj ; exec /usr/bin/program)

Yes, I saw that in archive, but did not receive myself, so did not
answer. This works in the above simple case, but if we dig a little bit
into the case when there are children, parent has to live and not all
children should be considered equal by the oom-killer, things change
dramatially. And we can not change the sources. Well, in particaular my
case we can, but it is not about the single system :)

> There are two problems; one is whether or not the OOM protection is
> inherited or not, and how one sets OOM protection --- and I think you
> will find a huge resistance to using names as a way of expressing
> policy.

Yup, this whole thread shows this resistance quite good :)

> The second problem is that oom_adj scoring is a hueristic which is
> hard for system administrators to understand --- and these are
> separable problems. Don't try to conflate them, and try using the
> fact that a random score echo'ed into /proc/pid/oom_adj is hard to
> tune as a justification for using process executable names.

I tried, and although I do agree on the fact that it can be used to turn
oom-killer on or off, but not for the tuning. But even this does not
really work in the case showed, when we can not change the application,
and having a main goal to save the parent and kill only some subset of
the short-living children. So we can not really adjust parent oom-score
and get the same in the children, since this will put parent and
important children at risk.

> If you want to argue that using containers is too hard, and there out
> to be a simpler tuning parameter where (for the sake of argument) all
> processes are given a number from 0 to 10, where 5 is the default, and
> higher numbers will be picked unconditionally over lower numbers, and
> the existing OOM score is used to distinguish between two process with
> the same OOM protection, that's fine.

> How we set that OOM protection class, whether it is via setrlimit() or
> echoing into a magic /proc/pid/oom_protection file, and whether it
> inherits across fork and exec calls, are a separate question.

Let's put containers out of the picture. While it may or may not work,
they are definitely not an issue in the given systems. Having simpler
tunables would be great, but we can not change them, since it is
already existing abi, documentation could be extended though, I can
cook up a patch tomorrow if no one else will do this.

--
Evgeniy Polyakov

2009-01-13 23:14:05

by David Rientjes

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Wed, 14 Jan 2009, Evgeniy Polyakov wrote:

> > Your patch simply allows users to specify a task by name that will always
> > be killed first when the oom killer is invoked. That's terribly
> > insufficient if another task uses an excessive amount of memory that you
> > didn't expect; a rogue task may be leaking memory and the task you've
> > identified by name with your patch is repeatedly forked and killed when
> > the rogue task goes untouched.
>
> It is up to user to decide, exactly the same will happen if you tune the
> oom_adj for the task.
>

It is up to the user to decide, using oom_adj scores as influence, how to
define when a task should be selected by the oom killer. Your
name-parsing hack can do that for a single global task, but oom_adj scores
are actually much more powerful.

> No, you can not. Did you try that? The only sane way to use oom_adj is
> to disable oom killer for the task or make its score very small or very
> big, there is really no way to make a finegrained tuning, since score
> changes and userspace does not know the algorithm.
>

We finely tune oom_adj scores so that we get the desired results, yes.
What you're complaining about here is purely a documentation issue.

> > Additionally, your patch completely breaks cpuset oom killing since
> > candidacy is determined in badness() because a task may have allocated
> > non-migrated memory elsewhere before being moved to a different cpuset.
> > Your oom_victim_name task may exist globally, but will always be
> > identified for oom kill even when the oom exists exclusively in a disjoint
> > cpuset. That does _not_ lead to future memory freeing that current can
> > use, and if the parent of the killed task decides to immediately fork
> > another instance, this cpuset will be completely livelocked.
>
> Please check the patch first. It selects process according to the
> badness, check memory group first and fallbacks to scan other processes
> if process with the given name was not found or name is null.
>

Again, your patch _completely_ breaks cpuset oom killing. That is a
completely separate issue than the memory controller, and it's
disappointing you still don't see it.

In a cpuset constrained oom condition, we do not explicitly exclude all
tasks that are in a disjoint, exclusive cpuset since it's quite possible
that a task has allocated memory outside its cpuset (either because its
cpuset assignment has changed or because its cpuset's mems has changed)
and killing it would free memory in current's cpuset. We do, however,
prefer to kill a task within the same cpuset; that preference is
implemented in the badness() scoring.

If a task exists on the system in a disjoint, exclusive cpuset that
matches oom_victim_name, your patch will cause it to be killed even though
badness() has penalized it for not sharing a cpuset (dividing its score by
eight). That probably needlessly killed oom_victim_name since it won't
allow for future memory freeing in the oom-triggering cpuset and the
original oom condition persists.

Now if the parent of that task or another system task forks
oom_victim_name again, the same thing will happen on the next iteration of
the oom killer. This will not free any memory in current's cpuset and it
will effectively be livelocked.

2009-01-13 23:27:13

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, 13 Jan 2009 14:54:08 +0300, Evgeniy Polyakov said:

> Who should adjust the scores for newly created processes? Who should
> check that processes in the first group have negative oom ajustment and
> in the second group a positive value? Who determines when its time to
> ajust the scores?

Are you saying that you, as the box's administrator, don't know the answers
to those questions?

Or are you asking what the actual method of implementation is - which process
is supposed to write to (possibly some other process) oom_adjust, and when?


Attachments:
(No filename) (226.00 B)

2009-01-13 23:35:37

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 03:10:50PM -0800, David Rientjes ([email protected]) wrote:
> On Wed, 14 Jan 2009, Evgeniy Polyakov wrote:
>
> > > Your patch simply allows users to specify a task by name that will always
> > > be killed first when the oom killer is invoked. That's terribly
> > > insufficient if another task uses an excessive amount of memory that you
> > > didn't expect; a rogue task may be leaking memory and the task you've
> > > identified by name with your patch is repeatedly forked and killed when
> > > the rogue task goes untouched.
> >
> > It is up to user to decide, exactly the same will happen if you tune the
> > oom_adj for the task.
> >
>
> It is up to the user to decide, using oom_adj scores as influence, how to
> define when a task should be selected by the oom killer. Your
> name-parsing hack can do that for a single global task, but oom_adj scores
> are actually much more powerful.
>
> > No, you can not. Did you try that? The only sane way to use oom_adj is
> > to disable oom killer for the task or make its score very small or very
> > big, there is really no way to make a finegrained tuning, since score
> > changes and userspace does not know the algorithm.
> >
>
> We finely tune oom_adj scores so that we get the desired results, yes.
> What you're complaining about here is purely a documentation issue.

Which does not work. Even besides documenation issue, which really means
that no one really tried to work with it :)

> > > Additionally, your patch completely breaks cpuset oom killing since
> > > candidacy is determined in badness() because a task may have allocated
> > > non-migrated memory elsewhere before being moved to a different cpuset.
> > > Your oom_victim_name task may exist globally, but will always be
> > > identified for oom kill even when the oom exists exclusively in a disjoint
> > > cpuset. That does _not_ lead to future memory freeing that current can
> > > use, and if the parent of the killed task decides to immediately fork
> > > another instance, this cpuset will be completely livelocked.
> >
> > Please check the patch first. It selects process according to the
> > badness, check memory group first and fallbacks to scan other processes
> > if process with the given name was not found or name is null.
> >
>
> Again, your patch _completely_ breaks cpuset oom killing. That is a
> completely separate issue than the memory controller, and it's
> disappointing you still don't see it.
>
> In a cpuset constrained oom condition, we do not explicitly exclude all
> tasks that are in a disjoint, exclusive cpuset since it's quite possible
> that a task has allocated memory outside its cpuset (either because its
> cpuset assignment has changed or because its cpuset's mems has changed)
> and killing it would free memory in current's cpuset. We do, however,
> prefer to kill a task within the same cpuset; that preference is
> implemented in the badness() scoring.
>
> If a task exists on the system in a disjoint, exclusive cpuset that
> matches oom_victim_name, your patch will cause it to be killed even though
> badness() has penalized it for not sharing a cpuset (dividing its score by
> eight). That probably needlessly killed oom_victim_name since it won't
> allow for future memory freeing in the oom-triggering cpuset and the
> original oom condition persists.

It is exactly the purpose of the patch: to kill what is requested to be
killed.

I wonder how do you expect users to guess via libastral that even
adjusted score does not work, since it happens that task is so special,
that it can not be killed :)

> Now if the parent of that task or another system task forks
> oom_victim_name again, the same thing will happen on the next iteration of
> the oom killer. This will not free any memory in current's cpuset and it
> will effectively be livelocked.

My knowledge about cpusets is somewhat between zero and void, even more
I opened mm/kill.c the first time when created a patch (oom-killer is
not that interesting actually, but it is a matter of taste of course).

I can create exactly the reverse situation when task is supposed to be
killed, but because of the cpuset/group/whatever else you pointed above,
its score will be decreased and rogue task will continue to live :)
This game can be played by both, but let's leave that for others.

The purpose of the patch is to create an ability to kill what is
needed by the user. Exactly what user wants. User can kill the system by
millions of ways, and we allow him to do so, but we do not really allow
him what to kill when system breaks into oom condition. With my patch it
is possible. And it is exactly what is expected by the user who does not
know anything else except the names of the applications he starts.
And please let's not start again with oom_adj if arguments are still the
same: it does not work in the case I showed.

Piece? :)

Please provide a way to fix the problem I described without my patch,
and everything will be immediately resolved :)

--
Evgeniy Polyakov

2009-01-13 23:37:34

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 06:26:05PM -0500, [email protected] ([email protected]) wrote:
> On Tue, 13 Jan 2009 14:54:08 +0300, Evgeniy Polyakov said:
>
> > Who should adjust the scores for newly created processes? Who should
> > check that processes in the first group have negative oom ajustment and
> > in the second group a positive value? Who determines when its time to
> > ajust the scores?
>
> Are you saying that you, as the box's administrator, don't know the answers
> to those questions?
>
> Or are you asking what the actual method of implementation is - which process
> is supposed to write to (possibly some other process) oom_adjust, and when?

Unfortunately I do not know the answers. Since all parameters are highly
dynamic and can not be always automatically tuned. And we do not have a
daemon to watch created processes. And in the general case we can not
change the application.

--
Evgeniy Polyakov

2009-01-13 23:44:30

by David Rientjes

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Wed, 14 Jan 2009, Evgeniy Polyakov wrote:

> Which does not work. Even besides documenation issue, which really means
> that no one really tried to work with it :)
>

Please. A lack of thorough documentation, while it should be fixed, does
not imply that a feature is not being used.

> > Again, your patch _completely_ breaks cpuset oom killing. That is a
> > completely separate issue than the memory controller, and it's
> > disappointing you still don't see it.
> >
> > In a cpuset constrained oom condition, we do not explicitly exclude all
> > tasks that are in a disjoint, exclusive cpuset since it's quite possible
> > that a task has allocated memory outside its cpuset (either because its
> > cpuset assignment has changed or because its cpuset's mems has changed)
> > and killing it would free memory in current's cpuset. We do, however,
> > prefer to kill a task within the same cpuset; that preference is
> > implemented in the badness() scoring.
> >
> > If a task exists on the system in a disjoint, exclusive cpuset that
> > matches oom_victim_name, your patch will cause it to be killed even though
> > badness() has penalized it for not sharing a cpuset (dividing its score by
> > eight). That probably needlessly killed oom_victim_name since it won't
> > allow for future memory freeing in the oom-triggering cpuset and the
> > original oom condition persists.
>
> It is exactly the purpose of the patch: to kill what is requested to be
> killed.
>

There are global system-wide oom conditions, cpuset-constrained oom
conditions, memory controller oom conditions, and mempolicy oom
conditions. You're patch affects them all, yet it is quite possible that
killing oom_victim_name will not alleviate the oom condition in a disjoint
cpuset. It would have been needlessly killed because you make no
distinction on the constraint of the oom.

> I wonder how do you expect users to guess via libastral that even
> adjusted score does not work, since it happens that task is so special,
> that it can not be killed :)
>
> My knowledge about cpusets is somewhat between zero and void, even more
> I opened mm/kill.c the first time when created a patch (oom-killer is
> not that interesting actually, but it is a matter of taste of course).
>

Being ignorant about cpusets doesn't justify you breaking their oom
handling.

2009-01-13 23:55:41

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 03:43:56PM -0800, David Rientjes ([email protected]) wrote:
> > Which does not work. Even besides documenation issue, which really means
> > that no one really tried to work with it :)
>
> Please. A lack of thorough documentation, while it should be fixed, does
> not imply that a feature is not being used.

Out of curiousity, how feature can be used, if no one except hardcore
kernel hackers know how to work with it? I do not insult, no, I'm really
curious. This may explain, why admins I worked with about this issue did
not fully succeeded with tuning.

> > It is exactly the purpose of the patch: to kill what is requested to be
> > killed.
> >
>
> There are global system-wide oom conditions, cpuset-constrained oom
> conditions, memory controller oom conditions, and mempolicy oom
> conditions. You're patch affects them all, yet it is quite possible that
> killing oom_victim_name will not alleviate the oom condition in a disjoint
> cpuset. It would have been needlessly killed because you make no
> distinction on the constraint of the oom.

Still it is possible to start a fork-bomb and kill the machine in some
cases, but we allow this. And also allow to limit amount of the
processes started by the user.

This is the same: we have several ways to solve oom-killer problem. Some
of them work in some cases, some in other. Proposed patch is another way
to deal with the problem. And in some cases it may be wrong. But if user
specified that behaviour, he knows what he is doing. Especially when
there is no way to properly implement the solution using existing
methods.

> > I wonder how do you expect users to guess via libastral that even
> > adjusted score does not work, since it happens that task is so special,
> > that it can not be killed :)
> >
> > My knowledge about cpusets is somewhat between zero and void, even more
> > I opened mm/kill.c the first time when created a patch (oom-killer is
> > not that interesting actually, but it is a matter of taste of course).
> >
>
> Being ignorant about cpusets doesn't justify you breaking their oom
> handling.

I did not break cpuset oom-handling, I provided a way to implement it
differently to solve the problem. Yes, this may have side effects, if
people care, they will not use the feature and leave victim name as NULL
(although allowing Kenny to live breaks the absolute fundamentals).
Those people who do need this functionality will work with it.

--
Evgeniy Polyakov

2009-01-14 00:25:42

by Bill Davidsen

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

Evgeniy Polyakov wrote:
> So effectively oom_adj only works as enable/disable switch, and since no
> one knows how to tune it, it is better to do not touch at all. And get
> ssh killed. I believe if it is ever used then only to disable oom at
> all, which is wrong, since task still may be killed but after some
> others. My patch adds a simple priority for that based on the name of
> the process, which are known to the administrators who maintain given
> system.
>
>
If nothing else, this would seem to reduce the number of processes for
which the OOM coefficient of evil must be calculated.

--
Bill Davidsen <[email protected]>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark

2009-01-14 00:33:17

by David Rientjes

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Wed, 14 Jan 2009, Evgeniy Polyakov wrote:

> Out of curiousity, how feature can be used, if no one except hardcore
> kernel hackers know how to work with it? I do not insult, no, I'm really
> curious. This may explain, why admins I worked with about this issue did
> not fully succeeded with tuning.
>

You read the code.

It's always great to improve the documentation of a kernel feature, and I
agree that it certainly applies in this case.

I think you could also improve how the badness() scoring is implemented to
make it easier to predict from userspace. I doubt you would find much
opposition to improving the heuristic; we cannot, however, change
/proc/pid/oom_adj since it already has users who depend on it.

> > Being ignorant about cpusets doesn't justify you breaking their oom
> > handling.
>
> I did not break cpuset oom-handling, I provided a way to implement it
> differently to solve the problem. Yes, this may have side effects, if
> people care, they will not use the feature and leave victim name as NULL
> (although allowing Kenny to live breaks the absolute fundamentals).
> Those people who do need this functionality will work with it.
>

You're treating each oom constraint like they are on the same; in a
cpuset-constrained oom, which can be much more common than system-wide
unconstrained ooms, we want to target a task that will allow for future
memory freeing in that cpuset.

So in these cases, to avoid needlessly killing your victim, you would be
forced to set oom_victim_name to NULL. That's hardly useful if the same
problem you're trying to fix still exists both globally and within a
cpuset. Your patch doesn't address this use case, so it's already
incomplete.

In a mempolicy-constrained oom as the result of MPOL_BIND, which can also
be much more common than system-wide unconstrained ooms, we want to target
current because it has allocations from the bound nodes. Your patch
doesn't touch this path, so it's already inconsistent.

I'm comfortable that this patch will not be merged, so I'll silently point
to past posts for the duration of this thread. I definitely think the
documentation can be improved and I don't think you'll have any opposition
to sane heuristic changes that also rely on userspace input via
/proc/pid/oom_adj. Thank you for working on this!

2009-01-14 00:35:58

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 07:24:35PM -0500, Bill Davidsen ([email protected]) wrote:
> >So effectively oom_adj only works as enable/disable switch, and since no
> >one knows how to tune it, it is better to do not touch at all. And get
> >ssh killed. I believe if it is ever used then only to disable oom at
> >all, which is wrong, since task still may be killed but after some
> >others. My patch adds a simple priority for that based on the name of
> >the process, which are known to the administrators who maintain given
> >system.
> >
> If nothing else, this would seem to reduce the number of processes for
> which the OOM coefficient of evil must be calculated.

Yes, it allows this.

--
Evgeniy Polyakov

2009-01-14 00:53:57

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 04:32:41PM -0800, David Rientjes ([email protected]) wrote:
> > Out of curiousity, how feature can be used, if no one except hardcore
> > kernel hackers know how to work with it? I do not insult, no, I'm really
> > curious. This may explain, why admins I worked with about this issue did
> > not fully succeeded with tuning.
>
> You read the code.

Not the best solution actually and at least not the simplest :)

> You're treating each oom constraint like they are on the same; in a
> cpuset-constrained oom, which can be much more common than system-wide
> unconstrained ooms, we want to target a task that will allow for future
> memory freeing in that cpuset.

I do not break the way oom problem is addressed currently. I just
extend it from the different angle, which can not be resolved in some
cases. Those who do not need the feature, can safely disable the check
and rely on the old algorithms.

What you are talking here is a different problem. Completely different
case. We may have the problem case you described, and we will think on
how to resolve the issue, but approach used in the patch does not
enforce the new policy, it extends it adding new global tunable.

> So in these cases, to avoid needlessly killing your victim, you would be
> forced to set oom_victim_name to NULL. That's hardly useful if the same
> problem you're trying to fix still exists both globally and within a
> cpuset. Your patch doesn't address this use case, so it's already
> incomplete.

Incorrect point of view. If administrator wants to select victim task by
name, this patch allows this. If he does not want this, it is turned
off. There are perfectly split areas where each approach applies: global
name-based selection and more specific to the areas where problem
arises.

> In a mempolicy-constrained oom as the result of MPOL_BIND, which can also
> be much more common than system-wide unconstrained ooms, we want to target
> current because it has allocations from the bound nodes. Your patch
> doesn't touch this path, so it's already inconsistent.

And again wrong conclusion: patch is intended to work in the area it was
created for. It is the simplest (and the only btw) solution for the showed
problem. In the systems where it is not needed, it will not be used and
old algorithms will work fine, apparently since no one proposed it
before, other areas work ok without it. While the problem (quite common
actually) I showed was not addressed at all and all proposed solutions
just failed if we start checking requrements more precisely.

> I'm comfortable that this patch will not be merged, so I'll silently point
> to past posts for the duration of this thread. I definitely think the
> documentation can be improved and I don't think you'll have any opposition
> to sane heuristic changes that also rely on userspace input via
> /proc/pid/oom_adj. Thank you for working on this!

So you agree that no existing solution can solve the oom problem when
parent task has to stay and children have to be differentiated. And
agree that not merging the solution for this commonly happened problem
is the right way. And while we can reread the whole thread multiple
times, we will find again and again that proposed approaches do not
work. This patch does its simple task without breaking others and in the
systems where this feature is not needed, it can be safely turned off,
while still fixing the problem for those who care.

I will update documentation tomorrow if there will be no patches.

--
Evgeniy Polyakov

2009-01-14 01:11:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Wed, Jan 14, 2009 at 02:02:40AM +0300, Evgeniy Polyakov wrote:
> > As Alan has already pointed out to you:
> >
> > (echo XXXX > /proc/self/oom_adj ; exec /usr/bin/program)
>
> Yes, I saw that in archive, but did not receive myself, so did not
> answer. This works in the above simple case, but if we dig a little bit
> into the case when there are children, parent has to live and not all
> children should be considered equal by the oom-killer, things change
> dramatially. And we can not change the sources. Well, in particaular my
> case we can, but it is not about the single system :)

I think you will find that most people are far more interested in
making sure we define consistent, usable interfaces --- and depending
on process names is a complete and total hack. Justifying it by
claiming that we won't be able to change application source code, so
we have to use a hack, isn't going to get you very far.

The security implications alone are troubling; OK, so we make the
process name "sshd" privileged and exempt from the OOM killer. What
happens if a user creates a program called sshd in their home
directory and executes it --- gee, it's protected from the OOM killer
as well. It's just not going to fly. Give up now.

If your argument is "we have to protect crappy closed source
applications where their programmers can't be bothered to change their
source code to use a proper interface", you're just going to get
laughed out of the room.

- Ted

2009-01-14 01:20:19

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 08:11:38PM -0500, Theodore Tso ([email protected]) wrote:
> I think you will find that most people are far more interested in
> making sure we define consistent, usable interfaces --- and depending
> on process names is a complete and total hack. Justifying it by
> claiming that we won't be able to change application source code, so
> we have to use a hack, isn't going to get you very far.

It is not about the possibility to change the sources, but the way
interface is exported to the userspace. Right now it is not usable for
some cases. And forcing applications, which are actually cross-platform,
depending on the way linux controls its own oom-killer is noticebly more
hackish than selecting a system-wide process by its name.

> The security implications alone are troubling; OK, so we make the
> process name "sshd" privileged and exempt from the OOM killer. What
> happens if a user creates a program called sshd in their home
> directory and executes it --- gee, it's protected from the OOM killer
> as well. It's just not going to fly. Give up now.

It is not about who is protected, but who will be selected to be killed.
If you have a rogue application which happend to have the right name,
everything is ok, otherwise it should be tuned further. And even in that
case nothing harmless will happen, since another processes will be
killed first (since admin selected the name on purpose to kill
potentially damaging applications).

> If your argument is "we have to protect crappy closed source
> applications where their programmers can't be bothered to change their
> source code to use a proper interface", you're just going to get
> laughed out of the room.

You believe that changing apache to control oom_adj is the right way to
deal with linux oom-killer? Do we already flight to the moon?

--
Evgeniy Polyakov

2009-01-14 04:07:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Wed, Jan 14, 2009 at 04:20:01AM +0300, Evgeniy Polyakov wrote:
>
> It is not about the possibility to change the sources, but the way
> interface is exported to the userspace. Right now it is not usable for
> some cases. And forcing applications, which are actually cross-platform,
> depending on the way linux controls its own oom-killer is noticebly more
> hackish than selecting a system-wide process by its name.

And we can change that interface if it's not the right one, or perhaps
extend it. After all, you are are proposing extending that interface;
just in a really horrible, hackish way.

> You believe that changing apache to control oom_adj is the right way to
> deal with linux oom-killer? Do we already flight to the moon?

Actually, I would believe the right answer is adding a new resource
limit which can be set using the standard getrlimit()/setrlimit()
interface, and then have apache use this standard interface as a way
of configuring itself --- much like how a process can change other
resource limits, such as the number of file descriptors it has open,
etc. And if what you want to do is simply make the process volunteer
to be one of the first processes shot by the OOM killer, the apache
process wouldn't even need setuid privileges to lower the "OOM
protection" resource limit.

I think this is cleaner than echoing a magic value to
/proc/self/oom_adj, and it won't be the first time various open source
programs have been changed to take advantage of Linux-specific
interfaces, especially if there's no other standard way of doing
things --- and an extension of BSD's getrlimit()/setrlimit() is
natural and makes a lot of sense. Heck, apache has been changed to
take advantage of Linux's epoll interface....

- Ted

2009-01-14 04:24:38

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Wed, 14 Jan 2009 02:35:02 +0300, Evgeniy Polyakov said:

> It is exactly the purpose of the patch: to kill what is requested to be
> killed.
>
> I wonder how do you expect users to guess via libastral that even
> adjusted score does not work, since it happens that task is so special,
> that it can not be killed :)

What does your patch do if one user has a process 'foo' that they're willing
to have die first, and a process 'bar' that absolutely can't be killed..

Meanwhile, another user has a 'must die first' process 'bar', and a 'must not
die' process 'foo'.

Methinks your patch needs libastral as well?


Attachments:
(No filename) (226.00 B)

2009-01-14 09:08:11

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Tue, Jan 13, 2009 at 11:23:57PM -0500, [email protected] ([email protected]) wrote:
> > I wonder how do you expect users to guess via libastral that even
> > adjusted score does not work, since it happens that task is so special,
> > that it can not be killed :)
>
> What does your patch do if one user has a process 'foo' that they're willing
> to have die first, and a process 'bar' that absolutely can't be killed..
>
> Meanwhile, another user has a 'must die first' process 'bar', and a 'must not
> die' process 'foo'.
>
> Methinks your patch needs libastral as well?

My patch needs an administrator to setup the name pattern.
Undocumented magic deeply hidden in the calculus of the badness() is quite different.

--
Evgeniy Polyakov

2009-01-14 16:12:38

by Evgeniy Polyakov

[permalink] [raw]
Subject: OOM documentation update [was: Linux killed Kenny, bastard!]

Please apply.
While existing interface do not fix the problems, they should be at least documented.

Sign.

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index d105eb4..f7530a1 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -2311,6 +2311,30 @@ increase the likelihood of this process being killed by the oom-killer. Valid
values are in the range -16 to +15, plus the special value -17, which disables
oom-killing altogether for this process.

+Process to be killed at out-of-memory situation is selected among all others
+based on its badness score. This value equals to the memory size of the process
+originally and then changed according to its cpu time (utime + stime) and the
+run time (uptime - start time). The longer it runs the smaller is the score.
+Badness score is devided by the sqare root of the cpu time and then by
+the double square root of the run time.
+
+/proc/<pid>/oom_score shows process' current badness score.
+
+Following heueristics are then applied:
+ * if task was reniced, its score doubles
+ * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
+ or CAP_SYS_RAWIO) have their score divided by 4
+ * if oom condition happened in one cpuset and checked task does not belong
+ to it, its score is divided by 8
+ * resulted score is multiplied by the two in the power of oom_adj when it is
+ positive, and devided otherwise, i.e.
+ points <<= oom_adj when it is positive and
+ points >>= oom_adj otherwise
+
+Swapped tasks are killed first.
+Task with the biggest number of badness points is selected to be killed.
+Usually children tasks are prefered compared to their parent.
+
2.13 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------



--
Evgeniy Polyakov

2009-01-14 17:06:28

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take2] OOM documentation update [was: Linux killed Kenny, bastard!]

Updated version fixes some errors and extends description by adding
swapping and children relation explaintaion.

Signed.

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index d105eb4..4aa1918 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -2311,6 +2311,33 @@ increase the likelihood of this process being killed by the oom-killer. Valid
values are in the range -16 to +15, plus the special value -17, which disables
oom-killing altogether for this process.

+The process to be killed in an out-of-memory situation is selected among all others
+based on its badness score. This value equals the original memory size of the process
+originally and is then updated according to its CPU time (utime + stime) and the
+run time (uptime - start time). The longer it runs the smaller is the score.
+Badness score is divided by the square root of the cpu time and then by
+the double square root of the run time.
+
+Swapped out tasks are killed first. Half of each child's memory size is added to
+the parent's score if they do not share the same memory. Thus forking servers
+are the prime candidates to be killed. Having only one 'hungry' child will make
+parent less preferable than the child.
+
+/proc/<pid>/oom_score shows process' current badness score.
+
+The following heuristics are then applied:
+ * if the task was reniced, its score doubles
+ * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
+ or CAP_SYS_RAWIO) have their score divided by 4
+ * if oom condition happened in one cpuset and checked task does not belong
+ to it, its score is divided by 8
+ * resulted score is multiplied by the two in the power of oom_adj when it is
+ positive, and divided otherwise, i.e.
+ points <<= oom_adj when it is positive and
+ points >>= oom_adj otherwise
+
+The task with the highest badness score is then killed.
+
2.13 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------



--
Evgeniy Polyakov

2009-01-14 19:19:15

by Bodo Eggert

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

Evgeniy Polyakov <[email protected]> wrote:
> On Mon, Jan 12, 2009 at 03:51:08PM +0000, Alan Cox ([email protected])

>> > Well, Kenny has to die, but if we still decide to change the world, here
>> > is the fist step.
>>
>> NAK this entire thing - we have an existing interface that does the job
>> far better.
>
> Mwahaha, I just checked how scores are calculated, so that userspace
> could adjust them. Let's start with beginning:

[snip]

> Do you _REALLY_ think anyone can calculate it yourself and then properly
> calculate adjustment used to properly select oom-killed process?

That's easy: Just let your Kenny process run, and check it's score. If it's
too low, increase the adjustment until it's just above the other processes'
score. Using binary search, you're done in five steps.

Then, while you're at it, protect the important programs by setting
their adjustment to -17.

2009-01-14 19:22:28

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Wed, Jan 14, 2009 at 08:18:49PM +0100, Bodo Eggert ([email protected]) wrote:
> > Mwahaha, I just checked how scores are calculated, so that userspace
> > could adjust them. Let's start with beginning:
>
> [snip]
>
> > Do you _REALLY_ think anyone can calculate it yourself and then properly
> > calculate adjustment used to properly select oom-killed process?
>
> That's easy: Just let your Kenny process run, and check it's score. If it's
> too low, increase the adjustment until it's just above the other processes'
> score. Using binary search, you're done in five steps.
>
> Then, while you're at it, protect the important programs by setting
> their adjustment to -17.

This does not work if processes are short-living and are spawned by the
parent on demand. If processes have different priority in regards to oom
condition, this problem can not be solved with existing interfaces
without changing the application. So effectively there is no solution.

--
Evgeniy Polyakov

2009-01-14 21:36:32

by Randy Dunlap

[permalink] [raw]
Subject: Re: [take2] OOM documentation update [was: Linux killed Kenny, bastard!]

On Wed, 14 Jan 2009 20:06:06 +0300 Evgeniy Polyakov wrote:

> Updated version fixes some errors and extends description by adding
> swapping and children relation explaintaion.
>
> Signed.

??

> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index d105eb4..4aa1918 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -2311,6 +2311,33 @@ increase the likelihood of this process being killed by the oom-killer. Valid
> values are in the range -16 to +15, plus the special value -17, which disables
> oom-killing altogether for this process.
>
> +The process to be killed in an out-of-memory situation is selected among all others
> +based on its badness score. This value equals the original memory size of the process
> +originally and is then updated according to its CPU time (utime + stime) and the

drop "originally" since earlier part of sentence says "original memory size".

> +run time (uptime - start time). The longer it runs the smaller is the score.
> +Badness score is divided by the square root of the cpu time and then by

CPU

> +the double square root of the run time.
> +
> +Swapped out tasks are killed first. Half of each child's memory size is added to
> +the parent's score if they do not share the same memory. Thus forking servers
> +are the prime candidates to be killed. Having only one 'hungry' child will make
> +parent less preferable than the child.
> +
> +/proc/<pid>/oom_score shows process' current badness score.
> +
> +The following heuristics are then applied:
> + * if the task was reniced, its score doubles
> + * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
> + or CAP_SYS_RAWIO) have their score divided by 4
> + * if oom condition happened in one cpuset and checked task does not belong
> + to it, its score is divided by 8
> + * resulted score is multiplied by the two in the power of oom_adj when it is

confusing. Is this: multiplied by two to the power of oom_adj when it is ... ?

> + positive, and divided otherwise, i.e.
> + points <<= oom_adj when it is positive and
> + points >>= oom_adj otherwise
> +
> +The task with the highest badness score is then killed.
> +
> 2.13 /proc/<pid>/oom_score - Display current oom-killer score
> -------------------------------------------------------------


---
~Randy

2009-01-14 21:53:28

by Bryan Donlan

[permalink] [raw]
Subject: Re: [take2] OOM documentation update [was: Linux killed Kenny, bastard!]

On Wed, Jan 14, 2009 at 12:06 PM, Evgeniy Polyakov <[email protected]> wrote:

> + * resulted score is multiplied by the two in the power of oom_adj when it is
> + positive, and divided otherwise, i.e.
> + points <<= oom_adj when it is positive and
> + points >>= oom_adj otherwise

Two to the power of a negative number is equivalent to dividing by two
to the power of said exponent's absolute value, making this paragraph
more than a bit confusing - indeed, a literal read would make it
equivalent to multiplying by 2^abs(oom_adj).

I would think that the following would be enough:
* The resulting score is multiplied by two to the power of oom_adj.

Unless we assume admins don't know how exponentiation by a negative
number works :)

2009-01-14 22:10:55

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take2] OOM documentation update [was: Linux killed Kenny, bastard!]

On Wed, Jan 14, 2009 at 04:53:16PM -0500, Bryan Donlan ([email protected]) wrote:
> On Wed, Jan 14, 2009 at 12:06 PM, Evgeniy Polyakov <[email protected]> wrote:
>
> > + * resulted score is multiplied by the two in the power of oom_adj when it is
> > + positive, and divided otherwise, i.e.
> > + points <<= oom_adj when it is positive and
> > + points >>= oom_adj otherwise
>
> Two to the power of a negative number is equivalent to dividing by two
> to the power of said exponent's absolute value, making this paragraph
> more than a bit confusing - indeed, a literal read would make it
> equivalent to multiplying by 2^abs(oom_adj).
>
> I would think that the following would be enough:
> * The resulting score is multiplied by two to the power of oom_adj.

Yes, I think it is enough with shift example.

Thanks guys I will update the doc.

--
Evgeniy Polyakov

2009-01-14 22:14:31

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take3] OOM documentation update [was: Linux killed Kenny, bastard!]


diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index d105eb4..eed2fbb 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -2311,6 +2311,32 @@ increase the likelihood of this process being killed by the oom-killer. Valid
values are in the range -16 to +15, plus the special value -17, which disables
oom-killing altogether for this process.

+The process to be killed in an out-of-memory situation is selected among all others
+based on its badness score. This value equals the original memory size of the process
+and is then updated according to its CPU time (utime + stime) and the
+run time (uptime - start time). The longer it runs the smaller is the score.
+Badness score is divided by the square root of the CPU time and then by
+the double square root of the run time.
+
+Swapped out tasks are killed first. Half of each child's memory size is added to
+the parent's score if they do not share the same memory. Thus forking servers
+are the prime candidates to be killed. Having only one 'hungry' child will make
+parent less preferable than the child.
+
+/proc/<pid>/oom_score shows process' current badness score.
+
+The following heuristics are then applied:
+ * if the task was reniced, its score doubles
+ * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
+ or CAP_SYS_RAWIO) have their score divided by 4
+ * if oom condition happened in one cpuset and checked task does not belong
+ to it, its score is divided by 8
+ * the resulting score is multiplied by two to the power of oom_adj, i.e.
+ points <<= oom_adj when it is positive and
+ points >>= -(oom_adj) otherwise
+
+The task with the highest badness score is then killed.
+
2.13 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------


--
Evgeniy Polyakov

2009-01-15 00:55:51

by David Rientjes

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Wed, 14 Jan 2009, Evgeniy Polyakov wrote:

> This does not work if processes are short-living and are spawned by the
> parent on demand. If processes have different priority in regards to oom
> condition, this problem can not be solved with existing interfaces
> without changing the application. So effectively there is no solution.
>

Wrong, you can change how the application is forked. Either immediately
adjust /proc/$!/oom_adj or use the adjustment inheritance property and
change /proc/$$/oom_adj to the desired value prior to forking. Thanks.

2009-01-15 01:00:29

by David Rientjes

[permalink] [raw]
Subject: Re: [take3] OOM documentation update [was: Linux killed Kenny, bastard!]

On Thu, 15 Jan 2009, Evgeniy Polyakov wrote:

> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index d105eb4..eed2fbb 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -2311,6 +2311,32 @@ increase the likelihood of this process being killed by the oom-killer. Valid
> values are in the range -16 to +15, plus the special value -17, which disables
> oom-killing altogether for this process.
>
> +The process to be killed in an out-of-memory situation is selected among all others
> +based on its badness score. This value equals the original memory size of the process
> +and is then updated according to its CPU time (utime + stime) and the
> +run time (uptime - start time). The longer it runs the smaller is the score.
> +Badness score is divided by the square root of the CPU time and then by
> +the double square root of the run time.
> +
> +Swapped out tasks are killed first. Half of each child's memory size is added to
> +the parent's score if they do not share the same memory. Thus forking servers
> +are the prime candidates to be killed. Having only one 'hungry' child will make
> +parent less preferable than the child.
> +
> +/proc/<pid>/oom_score shows process' current badness score.
> +
> +The following heuristics are then applied:
> + * if the task was reniced, its score doubles
> + * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
> + or CAP_SYS_RAWIO) have their score divided by 4
> + * if oom condition happened in one cpuset and checked task does not belong
> + to it, its score is divided by 8
> + * the resulting score is multiplied by two to the power of oom_adj, i.e.
> + points <<= oom_adj when it is positive and
> + points >>= -(oom_adj) otherwise
> +
> +The task with the highest badness score is then killed.
> +

Not quite, even after a task is selected for oom kill, the oom killer
still prefers to kill one of its children first if any have a different
mm. See oom_kill_process().

You also don't mention the exception of OOM_DISABLE (oom_adj score of -17)
in your formula for how oom_adj impacts the points value. Although its
already explained earlier, it should be mentioned here since a oom_adj is
an int and a right shift of 17 does not guarantee `points' will be 0.

2009-01-15 08:43:40

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Wed, Jan 14, 2009 at 04:54:09PM -0800, David Rientjes ([email protected]) wrote:
> > This does not work if processes are short-living and are spawned by the
> > parent on demand. If processes have different priority in regards to oom
> > condition, this problem can not be solved with existing interfaces
> > without changing the application. So effectively there is no solution.
> >
>
> Wrong, you can change how the application is forked. Either immediately
> adjust /proc/$!/oom_adj or use the adjustment inheritance property and
> change /proc/$$/oom_adj to the desired value prior to forking. Thanks.

You and Alan so like bash... Applications are not always forked from shell.

I already pointed multiple times where parent om_adj changes lead, and
that this does not work in a real world for some common cases. Existing
scheme only works if some daemon (or application itself) explicitely
changes oom_adj, but no dameon exists to monitor /proc and applications
do not change their own and child's oom_adj because it is way too
linuxish to add such hacks to deal with system's oom-killer, which can
not be properly configured otherwise.

--
Evgeniy Polyakov

2009-01-15 08:51:55

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take3] OOM documentation update [was: Linux killed Kenny, bastard!]

On Wed, Jan 14, 2009 at 04:58:41PM -0800, David Rientjes ([email protected]) wrote:
> > +
> > +The task with the highest badness score is then killed.
> > +
>
> Not quite, even after a task is selected for oom kill, the oom killer
> still prefers to kill one of its children first if any have a different
> mm. See oom_kill_process().

Ok, if it was not clear from the description.

> You also don't mention the exception of OOM_DISABLE (oom_adj score of -17)
> in your formula for how oom_adj impacts the points value. Although its
> already explained earlier, it should be mentioned here since a oom_adj is
> an int and a right shift of 17 does not guarantee `points' will be 0.

It is written several lines above.

--
Evgeniy Polyakov

2009-01-15 08:57:19

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take4] OOM documentation update [was: Linux killed Kenny, bastard!]

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index d105eb4..4902966 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -2311,6 +2311,34 @@ increase the likelihood of this process being killed by the oom-killer. Valid
values are in the range -16 to +15, plus the special value -17, which disables
oom-killing altogether for this process.

+The process to be killed in an out-of-memory situation is selected among all others
+based on its badness score. This value equals the original memory size of the process
+and is then updated according to its CPU time (utime + stime) and the
+run time (uptime - start time). The longer it runs the smaller is the score.
+Badness score is divided by the square root of the CPU time and then by
+the double square root of the run time.
+
+Swapped out tasks are killed first. Half of each child's memory size is added to
+the parent's score if they do not share the same memory. Thus forking servers
+are the prime candidates to be killed. Having only one 'hungry' child will make
+parent less preferable than the child.
+
+/proc/<pid>/oom_score shows process' current badness score.
+
+The following heuristics are then applied:
+ * if the task was reniced, its score doubles
+ * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
+ or CAP_SYS_RAWIO) have their score divided by 4
+ * if oom condition happened in one cpuset and checked task does not belong
+ to it, its score is divided by 8
+ * the resulting score is multiplied by two to the power of oom_adj, i.e.
+ points <<= oom_adj when it is positive and
+ points >>= -(oom_adj) otherwise
+
+The task with the highest badness score is then selected and its children
+are killed, process itself will be killed in an OOM situation when it does
+not have children or some of them disabled oom like described above.
+
2.13 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------



--
Evgeniy Polyakov

2009-01-15 11:15:10

by David Rientjes

[permalink] [raw]
Subject: Re: [take4] OOM documentation update [was: Linux killed Kenny, bastard!]

On Thu, 15 Jan 2009, Evgeniy Polyakov wrote:

> Signed-off-by: Evgeniy Polyakov <[email protected]>
>

Acked-by: David Rientjes <[email protected]>

2009-01-15 21:51:20

by Bodo Eggert

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Wed, 14 Jan 2009, Evgeniy Polyakov wrote:
> On Wed, Jan 14, 2009 at 08:18:49PM +0100, Bodo Eggert ([email protected]) wrote:

> > > Mwahaha, I just checked how scores are calculated, so that userspace
> > > could adjust them. Let's start with beginning:
> >
> > [snip]
> >
> > > Do you _REALLY_ think anyone can calculate it yourself and then properly
> > > calculate adjustment used to properly select oom-killed process?
> >
> > That's easy: Just let your Kenny process run, and check it's score. If it's
> > too low, increase the adjustment until it's just above the other processes'
> > score. Using binary search, you're done in five steps.
> >
> > Then, while you're at it, protect the important programs by setting
> > their adjustment to -17.
>
> This does not work if processes are short-living and are spawned by the
> parent on demand.

They will have the same name, too. Your Kenny-killer will fail, too.

> If processes have different priority in regards to oom
> condition, this problem can not be solved with existing interfaces
> without changing the application. So effectively there is no solution.

ACK, but being a child should count. Maybe the weight for childs should be
increased, if it does not do the right thing? Or maybe the childs do share
much (most of the) memory, so killing the parent is the right thing if you
want to free some RAM?

--
The complexity of a weapon is inversely proportional to the IQ of the
weapon's operator.

2009-01-15 22:58:13

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Thu, Jan 15, 2009 at 10:50:58PM +0100, Bodo Eggert ([email protected]) wrote:
> > This does not work if processes are short-living and are spawned by the
> > parent on demand.
>
> They will have the same name, too. Your Kenny-killer will fail, too.

It is not always the case, processes start executing different binaries
and change the names, that's at least what I observed in the particular
root case of the discussion.

> > If processes have different priority in regards to oom
> > condition, this problem can not be solved with existing interfaces
> > without changing the application. So effectively there is no solution.
>
> ACK, but being a child should count. Maybe the weight for childs should be
> increased, if it does not do the right thing? Or maybe the childs do share
> much (most of the) memory, so killing the parent is the right thing if you
> want to free some RAM?

There could be lots of heuristics applied for the different cases, but
without changing the application, they are somewhat limited to
long-living processes only. There are really lots of cases when it does
not stand.

--
Evgeniy Polyakov

2009-01-17 14:13:16

by Bodo Eggert

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Fri, 16 Jan 2009, Evgeniy Polyakov wrote:
> On Thu, Jan 15, 2009 at 10:50:58PM +0100, Bodo Eggert ([email protected]) wrote:

> > > This does not work if processes are short-living and are spawned by the
> > > parent on demand.
> >
> > They will have the same name, too. Your Kenny-killer will fail, too.
>
> It is not always the case, processes start executing different binaries
> and change the names, that's at least what I observed in the particular
> root case of the discussion.

In that case, you can use a wrapper script.

> > > If processes have different priority in regards to oom
> > > condition, this problem can not be solved with existing interfaces
> > > without changing the application. So effectively there is no solution.
> >
> > ACK, but being a child should count. Maybe the weight for childs should be
> > increased, if it does not do the right thing? Or maybe the childs do share
> > much (most of the) memory, so killing the parent is the right thing if you
> > want to free some RAM?
>
> There could be lots of heuristics applied for the different cases, but
> without changing the application, they are somewhat limited to
> long-living processes only. There are really lots of cases when it does
> not stand.

If it's short-lived enough, the processes will out-die the OOM-Killer.
You can only win by by suspending or killing the factory.
--
Why do men die before their wives?
They want to.

2009-01-17 14:22:57

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Sat, Jan 17, 2009 at 03:12:49PM +0100, Bodo Eggert ([email protected]) wrote:
> > > > This does not work if processes are short-living and are spawned by the
> > > > parent on demand.
> > >
> > > They will have the same name, too. Your Kenny-killer will fail, too.
> >
> > It is not always the case, processes start executing different binaries
> > and change the names, that's at least what I observed in the particular
> > root case of the discussion.
>
> In that case, you can use a wrapper script.

That may be a solution, except that not very convenient, since there may
be really lots of executables and cooking up a special script for
everyone will not scale well.

> > There could be lots of heuristics applied for the different cases, but
> > without changing the application, they are somewhat limited to
> > long-living processes only. There are really lots of cases when it does
> > not stand.
>
> If it's short-lived enough, the processes will out-die the OOM-Killer.
> You can only win by by suspending or killing the factory.

No, admin will limit/forbid the connection from the DoSing clients,
server must always live to handle proper users.

--
Evgeniy Polyakov

2009-01-17 15:22:20

by Bodo Eggert

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

Evgeniy Polyakov <[email protected]> wrote:

> The only sane way to use oom_adj is
> to disable oom killer for the task or make its score very small or very
> big, there is really no way to make a finegrained tuning, since score
> changes and userspace does not know the algorithm.

It does not need to know, it just has to tune until it's about level with
the other normal processes if it's to be a normal process, and add or
substract one or two to oom_adj if it's avery (un)important process.

2009-01-17 15:42:00

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Sat, Jan 17, 2009 at 04:21:50PM +0100, Bodo Eggert ([email protected]) wrote:
> > The only sane way to use oom_adj is
> > to disable oom killer for the task or make its score very small or very
> > big, there is really no way to make a finegrained tuning, since score
> > changes and userspace does not know the algorithm.
>
> It does not need to know, it just has to tune until it's about level with
> the other normal processes if it's to be a normal process, and add or
> substract one or two to oom_adj if it's avery (un)important process.

Did you try such tuning yourself? I submitted documentation update on
how score is calculated, there is really no way admin can do that in
some script or manually, and relying on oom_score content is not very
precise since it jumps all over the range depending on the read time.

So, forget that you can tune oom_adj system-wide, it is only possible to
effectively enable or disable it. If there are two identical processes,
it is possible in theory to tune them against each other, but practice
and theory are the same only in theory, in practice one of the processes
will suddenly start doing different things and all your calculus
immediately become wrong.

OOM scores can not be reliably tuned system-wide.

--
Evgeniy Polyakov

2009-01-18 12:37:29

by Bodo Eggert

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Sat, 17 Jan 2009, Evgeniy Polyakov wrote:

> On Sat, Jan 17, 2009 at 03:12:49PM +0100, Bodo Eggert ([email protected]) wrote:
> > > > > This does not work if processes are short-living and are spawned by the
> > > > > parent on demand.
> > > >
> > > > They will have the same name, too. Your Kenny-killer will fail, too.
> > >
> > > It is not always the case, processes start executing different binaries
> > > and change the names, that's at least what I observed in the particular
> > > root case of the discussion.
> >
> > In that case, you can use a wrapper script.
>
> That may be a solution, except that not very convenient, since there may
> be really lots of executables and cooking up a special script for
> everyone will not scale well.

How many different CGI handlers are you going to have?

And how does kill-kenny scale with the number of users on the system?
I want my browser not to be killed, while the other user wants his
gimp not to be killed. As you can see, it does not even scale for
the most simple multi-user system.

> > > There could be lots of heuristics applied for the different cases, but
> > > without changing the application, they are somewhat limited to
> > > long-living processes only. There are really lots of cases when it does
> > > not stand.
> >
> > If it's short-lived enough, the processes will out-die the OOM-Killer.
> > You can only win by by suspending or killing the factory.
>
> No, admin will limit/forbid the connection from the DoSing clients,
> server must always live to handle proper users.

If there is no memory, the admin can't even log in.
--
Programming is an art form that fights back.

2009-01-18 12:50:23

by Bodo Eggert

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Sat, 17 Jan 2009, Evgeniy Polyakov wrote:
> On Sat, Jan 17, 2009 at 04:21:50PM +0100, Bodo Eggert ([email protected]) wrote:

> > > The only sane way to use oom_adj is
> > > to disable oom killer for the task or make its score very small or very
> > > big, there is really no way to make a finegrained tuning, since score
> > > changes and userspace does not know the algorithm.
> >
> > It does not need to know, it just has to tune until it's about level with
> > the other normal processes if it's to be a normal process, and add or
> > substract one or two to oom_adj if it's avery (un)important process.
>
> Did you try such tuning yourself?

No, I just designed the oom_adj mechanism and left the implementation to
experienced people.

> I submitted documentation update on
> how score is calculated, there is really no way admin can do that in
> some script or manually, and relying on oom_score content is not very
> precise since it jumps all over the range depending on the read time.

The value is expeted to vary with the read time, since memory usage
does vary, too. If your sshd goes berserk and starts eating memory,
it may be smart to kill it. If it doesn't, the score should remain
about the same.

> So, forget that you can tune oom_adj system-wide, it is only possible to
> effectively enable or disable it.

It is, as long as the score varies with the badness of a process.

> If there are two identical processes,
> it is possible in theory to tune them against each other, but practice
> and theory are the same only in theory, in practice one of the processes
> will suddenly start doing different things and all your calculus
> immediately become wrong.
> OOM scores can not be reliably tuned system-wide.

As I said, if a program starts misbehaving, it's mostly not sane to keep
it alive. Besides that, the oom_adj allows you to special case some
processes to be immortal or to be kenny, but you should be prepared for
trouble then.
--
A Purple Heart just goes to prove that were you smart enough to think of a
plan, stupid enough to try it, and lucky enough to survive.

2009-01-18 13:13:19

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Sun, Jan 18, 2009 at 01:37:09PM +0100, Bodo Eggert ([email protected]) wrote:
> How many different CGI handlers are you going to have?

CGIs are usually limited, application server is not.

> And how does kill-kenny scale with the number of users on the system?
> I want my browser not to be killed, while the other user wants his
> gimp not to be killed. As you can see, it does not even scale for
> the most simple multi-user system.

It is not about who should not be killed, but who should _be_ in the
first raw.

> > No, admin will limit/forbid the connection from the DoSing clients,
> > server must always live to handle proper users.
>
> If there is no memory, the admin can't even log in.

Admin can observe the situation via kvm or sometimes netconsole and
tune the system for the next run.

--
Evgeniy Polyakov

2009-01-18 13:17:23

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Linux killed Kenny, bastard!

On Sun, Jan 18, 2009 at 01:49:58PM +0100, Bodo Eggert ([email protected]) wrote:
> > OOM scores can not be reliably tuned system-wide.
>
> As I said, if a program starts misbehaving, it's mostly not sane to keep
> it alive. Besides that, the oom_adj allows you to special case some
> processes to be immortal or to be kenny, but you should be prepared for
> trouble then.

oom_adj can turn off the oom-killer for given PID and make it
essentially immortal, it practically can not faingrainely tune this.
But even than can only be achieved with lots of problems, that's the
main point of the patch: to make things simple for the people who know
what they are doing.

--
Evgeniy Polyakov

2009-01-18 20:26:18

by Bodo Eggert

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Sun, 18 Jan 2009, Evgeniy Polyakov wrote:
> On Sun, Jan 18, 2009 at 01:37:09PM +0100, Bodo Eggert ([email protected]) wrote:

> > How many different CGI handlers are you going to have?
>
> CGIs are usually limited, application server is not.
>
> > And how does kill-kenny scale with the number of users on the system?
> > I want my browser not to be killed, while the other user wants his
> > gimp not to be killed. As you can see, it does not even scale for
> > the most simple multi-user system.
>
> It is not about who should not be killed, but who should _be_ in the
> first raw.

If it comes to the killing, it will start with the first row, or using your
patch, with the only man in the first row, named kenny. Now imagine a
phalanx of spawned kennies protecting a running-wild application from being
killed ...

If you set the oom_adj to mark the goat under normal conditions, the system
will adjust itself to abnormal conditions.

> > > No, admin will limit/forbid the connection from the DoSing clients,
> > > server must always live to handle proper users.
> >
> > If there is no memory, the admin can't even log in.
>
> Admin can observe the situation via kvm or sometimes netconsole and
> tune the system for the next run.

So your kill-kenny does not only require having exactly one goat system-wide
and no process having the same process name, but also constant supervision.
I think it's a really great design!
--
Whenever you have plenty of ammo, you never miss. Whenever you are low on
ammo, you can't hit the broad side of a barn.

2009-01-18 20:41:16

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!

On Sun, Jan 18, 2009 at 09:25:49PM +0100, Bodo Eggert ([email protected]) wrote:
> > It is not about who should not be killed, but who should _be_ in the
> > first raw.
>
> If it comes to the killing, it will start with the first row, or using your
> patch, with the only man in the first row, named kenny. Now imagine a
> phalanx of spawned kennies protecting a running-wild application from being
> killed ...
>
> If you set the oom_adj to mark the goat under normal conditions, the system
> will adjust itself to abnormal conditions.

Admin who sets is up knows what he is doing. Hope you will not argue
about the case, when admin will disable the oom-killer and will not be
able to log in.

Once again: this is an additional tunable which allows to easily solve
the problem showed here multiple times. And whily you did not try to
tune oom-adj yourself you continue arguing that it works the best. It
does not. Any solution for the showed problem is not a simple and
nice-looking, the one I proposed imo looks the most convenient for the
people who really work with the systems where described behaviour was
observed.

> > > > No, admin will limit/forbid the connection from the DoSing clients,
> > > > server must always live to handle proper users.
> > >
> > > If there is no memory, the admin can't even log in.
> >
> > Admin can observe the situation via kvm or sometimes netconsole and
> > tune the system for the next run.
>
> So your kill-kenny does not only require having exactly one goat system-wide
> and no process having the same process name, but also constant supervision.
> I think it's a really great design!

You should reread (better twice) what we are talking about here and what
and why patch was proposed. And how it works too.

--
Evgeniy Polyakov