2004-09-22 23:34:13

by Thomas Habets

[permalink] [raw]
Subject: [PATCH] oom_pardon, aka don't kill my xlock


Hello.

How about a sysctl that does "for the love of kbaek, don't ever kill these
processes when OOM. If nothing else can be killed, I'd rather you panic"?

Examples for this list would be /usr/bin/vlock and /usr/X11R6/bin/xlock.
I just got a very uncomfortable surprise when found my box unlocked thanks to
this.

After playing around a bit, I made the patch below, but it's almost completely
untested. I'm not even sure I take the binaries name from the right place.
And I don't know if the locking can race. If it's too ugly then it'd be great
if someone implemented it the right way. (iow: huge fucking disclaimer)

echo "/usr/bin/vlock /usr/X11R6/bin/xlock" > /proc/sys/vm/oom_pardon

---------
typedef struct me_s {
char name[] = { "Thomas Habets" };
char email[] = { "[email protected]" };
char kernel[] = { "Linux" };
char *pgpKey[] = { "http://www.habets.pp.se/pubkey.txt" };
char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE 0945 286A E90A AD48 E854" };
char coolcmd[] = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;

diff -Nur linux-2.6.7.orig/CREDITS linux-2.6.7/CREDITS
--- linux-2.6.7.orig/CREDITS 2004-06-16 07:19:43.000000000 +0200
+++ linux-2.6.7/CREDITS 2004-09-23 00:02:44.000000000 +0200
@@ -1210,6 +1210,14 @@
W: http://www.inf.fu-berlin.de/~ehaase
D: Driver for the Commodore A2232 serial board

+N: Thomas Habets
+E: [email protected]
+D: random Linux hacker
+P: 1024D/AD48E854 A8A3 D1DD 4AE0 8467 7FDE 0945 286A E90A AD48 E854
+S: Tunnlandsv?gen 40
+S: 168 36 Bromma
+S: Sweden
+
N: Bruno Haible
E: [email protected]
D: SysV FS, shm swapping, memory management fixes
diff -Nur linux-2.6.7.orig/include/linux/mm.h linux-2.6.7/include/linux/mm.h
--- linux-2.6.7.orig/include/linux/mm.h 2004-06-16 07:18:56.000000000 +0200
+++ linux-2.6.7/include/linux/mm.h 2004-09-23 00:28:53.000000000 +0200
@@ -21,6 +21,8 @@
extern unsigned long max_mapnr;
#endif

+#define VM_OOM_PARDON_LEN 256
+extern char vm_oom_pardon[VM_OOM_PARDON_LEN];
extern unsigned long num_physpages;
extern void * high_memory;
extern int page_cluster;
diff -Nur linux-2.6.7.orig/include/linux/sysctl.h
linux-2.6.7/include/linux/sysctl.h
--- linux-2.6.7.orig/include/linux/sysctl.h 2004-06-16 07:19:35.000000000
+0200
+++ linux-2.6.7/include/linux/sysctl.h 2004-09-23 00:20:37.000000000 +0200
@@ -164,6 +164,7 @@
VM_LAPTOP_MODE=23, /* vm laptop mode */
VM_BLOCK_DUMP=24, /* block dump mode */
VM_HUGETLB_GROUP=25, /* permitted hugetlb group */
+ VM_OOM_PARDON=26, /* don't oom-kill these processes */
};


diff -Nur linux-2.6.7.orig/kernel/sysctl.c linux-2.6.7/kernel/sysctl.c
--- linux-2.6.7.orig/kernel/sysctl.c 2004-06-16 07:18:58.000000000 +0200
+++ linux-2.6.7/kernel/sysctl.c 2004-09-23 00:28:51.000000000 +0200
@@ -795,6 +795,15 @@
.strategy = &sysctl_intvec,
.extra1 = &zero,
},
+ {
+ .ctl_name = VM_OOM_PARDON,
+ .procname = "oom_pardon",
+ .data = &vm_oom_pardon,
+ .maxlen = sizeof(vm_oom_pardon),
+ .mode = 0644,
+ .proc_handler = &proc_doutsstring,
+ .strategy = &sysctl_string,
+ },
{ .ctl_name = 0 }
};

diff -Nur linux-2.6.7.orig/mm/oom_kill.c linux-2.6.7/mm/oom_kill.c
--- linux-2.6.7.orig/mm/oom_kill.c 2004-06-16 07:19:29.000000000 +0200
+++ linux-2.6.7/mm/oom_kill.c 2004-09-23 00:31:12.000000000 +0200
@@ -16,14 +16,56 @@
*/

#include <linux/mm.h>
+#include <linux/utsname.h>
#include <linux/sched.h>
#include <linux/swap.h>
#include <linux/timex.h>
#include <linux/jiffies.h>

+char vm_oom_pardon[VM_OOM_PARDON_LEN];
/* #define DEBUG */

/**
+ * For the love of kbaek, don't kill processes in /proc/sys/vm/oom_pardon
+ */
+static int pardon(struct task_struct *task)
+{
+ static char buf[256];
+ const struct qstr *exe;
+ const char *p;
+ int len;
+
+ exe = &task->proc_dentry->d_name;
+ len = min((int)exe->len, (int)(sizeof(buf) - 2));
+
+ memcpy(buf, exe->name, len);
+ buf[len] = 0;
+ buf[len+1] = 0;
+
+ if (strchr(buf, ' ')) {
+ return 0;
+ }
+
+ down_read(&uts_sem);
+ p = vm_oom_pardon;
+ do {
+ buf[len] = ' ';
+ if (!strncmp(p, buf, len)) {
+ return 1;
+ }
+
+ buf[len] = 0;
+ if (!strcmp(p, buf)) {
+ return 1;
+ }
+ p = strchr(p, ' ');
+ } while(p++);
+ up_read(&uts_sem);
+
+ return 0;
+}
+
+/**
* oom_badness - calculate a numeric value for how bad this task has been
* @p: task struct of which task we should calculate
*
@@ -50,6 +92,10 @@

if (p->flags & PF_MEMDIE)
return 0;
+
+ if (pardon(p))
+ return 0;
+
/*
* The memory size of the process is the basis for the badness.
*/


Attachments:
(No filename) (4.84 kB)
(No filename) (189.00 B)
Download all attachments

2004-09-23 00:01:56

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

Thomas Habets wrote:
> Hello.
>
> How about a sysctl that does "for the love of kbaek, don't ever kill these
> processes when OOM. If nothing else can be killed, I'd rather you panic"?
>
> Examples for this list would be /usr/bin/vlock and /usr/X11R6/bin/xlock.
> I just got a very uncomfortable surprise when found my box unlocked thanks to
> this.
>
> After playing around a bit, I made the patch below, but it's almost completely
> untested. I'm not even sure I take the binaries name from the right place.
> And I don't know if the locking can race. If it's too ugly then it'd be great
> if someone implemented it the right way. (iow: huge fucking disclaimer)
>
> echo "/usr/bin/vlock /usr/X11R6/bin/xlock" > /proc/sys/vm/oom_pardon
>

Hi,
Nice idea. It could probably made include-worthy if you just set a flag in the
task struct in question.

Also, use pid numbers instead of names, I think. (Or prctl? What is the
'preferred' way of setting random per-process flags?)

Nick

2004-09-23 00:07:51

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Thu, Sep 23, 2004 at 01:23:08AM +0200, Thomas Habets wrote:
> How about a sysctl that does "for the love of kbaek, don't ever kill these
> processes when OOM. If nothing else can be killed, I'd rather you panic"?
> Examples for this list would be /usr/bin/vlock and /usr/X11R6/bin/xlock.
> I just got a very uncomfortable surprise when found my box unlocked thanks to
> this.
> After playing around a bit, I made the patch below, but it's almost
> completely untested. I'm not even sure I take the binaries name from
> the right place. And I don't know if the locking can race. If it's
> too ugly then it'd be great if someone implemented it the right way.
> (iow: huge fucking disclaimer)
> echo "/usr/bin/vlock /usr/X11R6/bin/xlock" > /proc/sys/vm/oom_pardon

Assuming this is desirable (otherwise, why would you have written it?)

(1) uts_sem isn't the right lock.
(2) You acquire uts_sem under tasklist_lock, a deadlock.
(3) It would probably make more sense to dynamically register and
unregister the various criteria for exempt processes than mess
with space-separated fields of a single string.


-- wli

2004-09-23 04:47:44

by Tonnerre

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

Salut,

On Thu, Sep 23, 2004 at 01:23:08AM +0200, Thomas Habets wrote:
> diff -Nur linux-2.6.7.orig/CREDITS linux-2.6.7/CREDITS
> --- linux-2.6.7.orig/CREDITS 2004-06-16 07:19:43.000000000 +0200
> +++ linux-2.6.7/CREDITS 2004-09-23 00:02:44.000000000 +0200
> @@ -1210,6 +1210,14 @@
> W: http://www.inf.fu-berlin.de/~ehaase
> D: Driver for the Commodore A2232 serial board
>
> +N: Thomas Habets
> +E: [email protected]
> +D: random Linux hacker
> +P: 1024D/AD48E854 A8A3 D1DD 4AE0 8467 7FDE 0945 286A E90A AD48 E854
> +S: Tunnlandsv?gen 40
> +S: 168 36 Bromma
> +S: Sweden
> +
> N: Bruno Haible
> E: [email protected]
> D: SysV FS, shm swapping, memory management fixes

That should be D: OOM killer exceptions or whatever, I suppose.

> diff -Nur linux-2.6.7.orig/kernel/sysctl.c linux-2.6.7/kernel/sysctl.c
> --- linux-2.6.7.orig/kernel/sysctl.c 2004-06-16 07:18:58.000000000 +0200
> +++ linux-2.6.7/kernel/sysctl.c 2004-09-23 00:28:51.000000000 +0200
> @@ -795,6 +795,15 @@
> .strategy = &sysctl_intvec,
> .extra1 = &zero,
> },
> + {
> + .ctl_name = VM_OOM_PARDON,
> + .procname = "oom_pardon",
> + .data = &vm_oom_pardon,
> + .maxlen = sizeof(vm_oom_pardon),
> + .mode = 0644,
> + .proc_handler = &proc_doutsstring,
> + .strategy = &sysctl_string,
> + },
> { .ctl_name = 0 }
> };
>

A sysctl is a bad implementation since you can only store one single
string in it.

> diff -Nur linux-2.6.7.orig/mm/oom_kill.c linux-2.6.7/mm/oom_kill.c
> --- linux-2.6.7.orig/mm/oom_kill.c 2004-06-16 07:19:29.000000000 +0200
> +++ linux-2.6.7/mm/oom_kill.c 2004-09-23 00:31:12.000000000 +0200
> @@ -16,14 +16,56 @@
> */
>
> #include <linux/mm.h>
> +#include <linux/utsname.h>
> #include <linux/sched.h>
> #include <linux/swap.h>
> #include <linux/timex.h>
> #include <linux/jiffies.h>
>
> +char vm_oom_pardon[VM_OOM_PARDON_LEN];
> /* #define DEBUG */
>
> /**
> + * For the love of kbaek, don't kill processes in /proc/sys/vm/oom_pardon
> + */
> +static int pardon(struct task_struct *task)
> +{
> + static char buf[256];

That 256 should be VM_OOM_PARDON_LEN ?

> + const struct qstr *exe;
> + const char *p;
> + int len;
> +
> + exe = &task->proc_dentry->d_name;
> + len = min((int)exe->len, (int)(sizeof(buf) - 2));

Dito.

> + memcpy(buf, exe->name, len);
> + buf[len] = 0;
> + buf[len+1] = 0;
> +
> + if (strchr(buf, ' ')) {
> + return 0;
> + }
> +
> + down_read(&uts_sem);

We're under the task lock, and you want us to sleep here? There's a
little problem: we'd want to switch the task, and since the task lock
is taken, we'll wait an infinite amount of time (yes, literally!) for
it to become free.

> + p = vm_oom_pardon;
> + do {
> + buf[len] = ' ';
> + if (!strncmp(p, buf, len)) {
> + return 1;
> + }
> +
> + buf[len] = 0;
> + if (!strcmp(p, buf)) {
> + return 1;
> + }
> + p = strchr(p, ' ');
> + } while(p++);

What about programs with spaces in its names?

Actually, I'd really use a different interface to register and
unregister processes to protect. And maybe not (just) by the binary
name. Make a real filter list, or track them by pid.

> + up_read(&uts_sem);
> +
> + return 0;
> +}
> +
> +/**
> * oom_badness - calculate a numeric value for how bad this task has been
> * @p: task struct of which task we should calculate
> *

Tonnerre


Attachments:
(No filename) (3.67 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2004-09-23 06:59:20

by Thomas Habets

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

Once upon a midnight dreary, Tonnerre pondered, weak and weary:
> A sysctl is a bad implementation since you can only store one single
> string in it.

Yup. What would be a good interface for setting that flag per-process?
prctl()?
Personally, I'd prefer it without userspace having to write code for it.

Also, it should be able to protect against a DoS where a user launches N
un-OOM-killable processes.

> > + static char buf[256];
> That 256 should be VM_OOM_PARDON_LEN ?

Nope, it's the binary path len, so PATH_MAX maybe.

> We're under the task lock, and you want us to sleep here?

Oh.. right. Well I don't need that lock if it's just a per-process flag.

> What about programs with spaces in its names?

I thought "screw 'em". :-)

---------
typedef struct me_s {
char name[] = { "Thomas Habets" };
char email[] = { "[email protected]" };
char kernel[] = { "Linux" };
char *pgpKey[] = { "http://www.habets.pp.se/pubkey.txt" };
char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE 0945 286A E90A AD48 E854" };
char coolcmd[] = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;


Attachments:
(No filename) (1.08 kB)
(No filename) (189.00 B)
Download all attachments

2004-09-23 12:26:18

by Tonnerre

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

Salut,

On Thu, Sep 23, 2004 at 08:57:50AM +0200, Thomas Habets wrote:
> Yup. What would be a good interface for setting that flag per-process?
> prctl()?
> Personally, I'd prefer it without userspace having to write code for it.

Well, either via a new syscall/ioctl, or via some exported file in
/proc or /sys. I guess the second approach (file) will be prefered.

> Also, it should be able to protect against a DoS where a user launches N
> un-OOM-killable processes.

You can still do that. Maybe kill those processes first who got less
criterias matching the OOM gracefulness, so you can protect httpd more
strongly than xlock.

Also remember to set per-user limits of processes. :)

> > What about programs with spaces in its names?
>
> I thought "screw 'em". :-)

Now that's what I call policy!

Tonnerre


Attachments:
(No filename) (828.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2004-09-23 13:44:03

by Thomas Habets

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

Once upon a midnight dreary, Tonnerre pondered, weak and weary:
> > Yup. What would be a good interface for setting that flag per-process?
> Well, either via a new syscall/ioctl, or via some exported file in
> /proc or /sys.

In /proc, you mean /proc/<pid>/oom_pardon then?
I didn't see any other settings there, so I thought it might be the wrong
place.

Or should it maybe be a multiline rule file in /proc/sys/net/vm/oom_pardon:
0:exe /usr/bin/vlock
+10:user jerry
+5:user bob:exe /usr/bin/vlock
+100000:user !thomas:exe /usr/bin/emacs

: separating fields and \: escaping it. Maybe skip "exe" and "user" and have
fixed fields.

Then match the whole table for every task on OOM, setting the absolute badness
if there's no +/- and change relatively if there is.
Don't exit on match, so the first two would both apply to jerrys vlock, giving
it a badness of 10, and bobs would get 5.

And probably uid instead of username.

Hmm, or maybe this is overkill? But having it apply to every newly-created
process before a daemon could have the time to apply badness via *ctl() on
every new process would be nice.

> so you can protect httpd more strongly than xlock.

Never! :-)

> > > What about programs with spaces in its names?
> > I thought "screw 'em". :-)
> Now that's what I call policy!

You gotta let the processes know who's boss.

---------
typedef struct me_s {
char name[] = { "Thomas Habets" };
char email[] = { "[email protected]" };
char kernel[] = { "Linux" };
char *pgpKey[] = { "http://www.habets.pp.se/pubkey.txt" };
char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE 0945 286A E90A AD48 E854" };
char coolcmd[] = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;


Attachments:
(No filename) (1.66 kB)
(No filename) (189.00 B)
Download all attachments

2004-09-23 23:58:49

by Andries Brouwer

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Thu, Sep 23, 2004 at 01:23:08AM +0200, Thomas Habets wrote:

> How about a sysctl that does "for the love of kbaek, don't ever kill these
> processes when OOM. If nothing else can be killed, I'd rather you panic"?

An aircraft company discovered that it was cheaper to fly its planes
with less fuel on board. The planes would be lighter and use less fuel
and money was saved. On rare occasions however the amount of fuel was
insufficient, and the plane would crash. This problem was solved by
the engineers of the company by the development of a special OOF
(out-of-fuel) mechanism. In emergency cases a passenger was selected
and thrown out of the plane. (When necessary, the procedure was
repeated.) A large body of theory was developed and many publications
were devoted to the problem of properly selecting the victim to be
ejected. Should the victim be chosen at random? Or should one choose
the heaviest person? Or the oldest? Should passengers pay in order not
to be ejected, so that the victim would be the poorest on board? And
if for example the heaviest person was chosen, should there be a
special exception in case that was the pilot? Should first class
passengers be exempted? Now that the OOF mechanism existed, it would
be activated every now and then, and eject passengers even when there
was no fuel shortage. The engineers are still studying precisely how
this malfunction is caused.

2004-09-24 14:18:36

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

Hi!

> > How about a sysctl that does "for the love of kbaek, don't ever kill these
> > processes when OOM. If nothing else can be killed, I'd rather you panic"?

You should not be able to force kernel panic. But it might make sense to say
"if you want to kill xlock, kill all processes with same uid, too".

> An aircraft company discovered that it was cheaper to fly its planes
> with less fuel on board. The planes would be lighter and use less fuel
> and money was saved. On rare occasions however the amount of fuel was
> insufficient, and the plane would crash. This problem was solved by
> the engineers of the company by the development of a special OOF
> (out-of-fuel) mechanism. In emergency cases a passenger was selected
> and thrown out of the plane. (When necessary, the procedure was
> repeated.) A large body of theory was developed and many publications
> were devoted to the problem of properly selecting the victim to be
> ejected. Should the victim be chosen at random? Or should one choose
> the heaviest person? Or the oldest? Should passengers pay in order not
> to be ejected, so that the victim would be the poorest on board? And
> if for example the heaviest person was chosen, should there be a
> special exception in case that was the pilot? Should first class
> passengers be exempted? Now that the OOF mechanism existed, it would
> be activated every now and then, and eject passengers even when there
> was no fuel shortage. The engineers are still studying precisely how
> this malfunction is caused.

:-))))

--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2004-09-24 14:22:42

by Alan

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Gwe, 2004-09-24 at 00:45, Andries Brouwer wrote:
> On Thu, Sep 23, 2004 at 01:23:08AM +0200, Thomas Habets wrote:
>
> > How about a sysctl that does "for the love of kbaek, don't ever kill these
> > processes when OOM. If nothing else can be killed, I'd rather you panic"?
>
> An aircraft company discovered that it was cheaper to fly its planes
> with less fuel on board. The planes would be lighter and use less fuel

Chuckle.

The rest of us just turn on "no overcommit" in /proc/sys. That way we
get told before takeoff that there isnt enough memory instead. Really
I'm astound that the default is still to pray you don't get thrown off
the plane.

Alan

2004-09-24 20:07:23

by Thomas Habets

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

Once upon a midnight dreary, Alan Cox pondered, weak and weary:
> The rest of us just turn on "no overcommit" in /proc/sys.

$ cat /proc/sys/vm/overcommit_{memory,ratio}
0
50

Well that didn't help.
Me thinks the text in Documentation/filesystems/proc.txt may be out of date,
especially considering it doesn't say the same as
Documentation/vm/overcommit-accounting.txt.

overcommit_memory
-----------------

This file contains one value. The following algorithm is used to decide if
there's enough memory: if the value of overcommit_memory is positive, then
there's always enough memory. This is a useful feature, since programs often
malloc() huge amounts of memory 'just in case', while they only use a small
part of it. Leaving this value at 0 will lead to the failure of such a huge
malloc(), when in fact the system has enough memory for the program to run.

On the other hand, enabling this feature can cause you to run out of memory
and thrash the system to death, so large and/or important servers will want to
set this value to 0.


And also, I'd like to see how a misbehaving airline passenger could start to
gain weight not originally on the plane, causing the flight attendants to
start executing people because of OOF. And IIRC most airlines don't like
having women onboard who are way too pregnant, so no forking either.

---------
typedef struct me_s {
char name[] = { "Thomas Habets" };
char email[] = { "[email protected]" };
char kernel[] = { "Linux" };
char *pgpKey[] = { "http://www.habets.pp.se/pubkey.txt" };
char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE 0945 286A E90A AD48 E854" };
char coolcmd[] = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;


Attachments:
(No filename) (1.67 kB)
(No filename) (189.00 B)
Download all attachments

2004-09-24 22:18:27

by Alan

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Gwe, 2004-09-24 at 20:58, Thomas Habets wrote:
> And also, I'd like to see how a misbehaving airline passenger could start to
> gain weight not originally on the plane, causing the flight attendants to
> start executing people because of OOF. And IIRC most airlines don't like
> having women onboard who are way too pregnant, so no forking either.

The zero over commit code makes sure that we have enough swap/memory for
fillable address space. It means the application will always be told
when it takes an action that it cannot do it, rather than finding out
later and being killed.

Alan

2004-09-24 22:57:54

by Jon Masters

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Fri, 24 Sep 2004 01:45:20 +0200, Andries Brouwer <[email protected]> wrote:

> On Thu, Sep 23, 2004 at 01:23:08AM +0200, Thomas Habets wrote:

> > How about a sysctl that does "for the love of kbaek, don't ever kill these
> > processes when OOM. If nothing else can be killed, I'd rather you panic"?

> An aircraft company discovered that it was cheaper to fly its planes
> with less fuel on board. The planes would be lighter and use less fuel
> and money was saved. On rare occasions however the amount of fuel was
> insufficient, and the plane would crash. This problem was solved by
> the engineers of the company by the development of a special OOF
> (out-of-fuel) mechanism.

For the curious I have recently been reading about this (I'm a nervous
flyer, you wouldn't believe the kind of statistics I scare myself
with) and discovered the term RAT - RAM Air Turbine. In the event of
fuel running out, modern aircraft automatically drop this turbine and
generate sufficient power for navigation (and hopefully safe landing).
There's a famous early 1980s case in Canada known as the Gimli Glider
in which this actually ended up happing after a computer that
performed imperial/metric conversion failed and the manual calculation
was wrong - they coined a popular Canadian phrase because of this.

What we need is a mechanism to have a giant brainstraw emerge from the
front casing of the machine and suck the brains out of the guy running
a server with overcommit issues.

[ Alan has a working model super drinking straw from OLS - pity you
destroyed it. ]

Jon.

2004-09-25 10:10:34

by Thomas Habets

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

Once upon a midnight dreary, Alan Cox pondered, weak and weary:
> > And also, I'd like to see how a misbehaving airline passenger could start
> > to gain weight not originally on the plane, causing the flight attendants
> > to start executing people because of OOF. And IIRC most airlines don't
> > like having women onboard who are way too pregnant, so no forking either.
> The zero over commit code makes sure that we have enough swap/memory

I was talking about the simile.

---------
typedef struct me_s {
char name[] = { "Thomas Habets" };
char email[] = { "[email protected]" };
char kernel[] = { "Linux" };
char *pgpKey[] = { "http://www.habets.pp.se/pubkey.txt" };
char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE 0945 286A E90A AD48 E854" };
char coolcmd[] = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;


Attachments:
(No filename) (833.00 B)
(No filename) (189.00 B)
Download all attachments

2004-09-25 16:32:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Thu, Sep 23, 2004 at 01:23:08AM +0200, Thomas Habets wrote:
>
> Hello.
>
> How about a sysctl that does "for the love of kbaek, don't ever kill these
> processes when OOM. If nothing else can be killed, I'd rather you panic"?
>
> Examples for this list would be /usr/bin/vlock and /usr/X11R6/bin/xlock.
> I just got a very uncomfortable surprise when found my box unlocked thanks to
> this.
>
> After playing around a bit, I made the patch below, but it's almost completely
> untested. I'm not even sure I take the binaries name from the right place.
> And I don't know if the locking can race. If it's too ugly then it'd be great
> if someone implemented it the right way. (iow: huge fucking disclaimer)
>
> echo "/usr/bin/vlock /usr/X11R6/bin/xlock" > /proc/sys/vm/oom_pardon

we had to implement this for our kernel too. The logic is similar to the
one in your patch. The oom killer is not trustable on servers where the
biggest task is the only one we trust, this patch tries to workaround
it. This is just for your info and to standardize on the same API if it
overlaps functionality.

this patch is from Kurt and it's called protect-pids. Even after
rewriting the oom killer, this would probably remain a nice feature for
a few selected trusted apps.

The value is inherited by fork, one slight modification I would probably
like is to reset the value during exec so it can be safely used on ssh
too (so that bash and its childs aren't unkillable too, currently
enabling this on ssh is like allowing all normal users logging in to DoS
the machine, but it's more generic for bigger apps like databases that
may fork+exec). I'd probably also place oomkilladj at the end of the
task struct instead of counting the bits but it doesn't make any
difference in practice.

Comments?

--------------------------------------------------------------------------
Feature.

This patch changes the way Linux choses processes when an out-of-memory (OOM)
situation is detected.
* It sends a SIGTERM first to a process. If the process is selected a
second time by the OOM killer because we're still OOM, it will get
a SIGKILL.
* It does export the OOM score of a process to userspace in
/proc/$PID/oom_score
* It does allow the admin (CAP_SYS_RESOURCE) to change the OOM score by a
power of two via
/proc/$PID/oom_adj
The number needs to be in [-16..15], where -16 means that the score is
lowered by a factor of 65536(2^16), thus making the process very unlikely
to be chosen. Normally, numbers in [-8..7] should be used ...

fs/proc/base.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/sched.h | 4 ++
mm/oom_kill.c | 25 ++++++++++++++--
3 files changed, 98 insertions(+), 5 deletions(-)

diff -uNrp linux-2.6.4.old/fs/proc/base.c linux-2.6.4/fs/proc/base.c
--- linux-2.6.4.old/fs/proc/base.c 2004-04-01 19:57:29.000000000 +0200
+++ linux-2.6.4/fs/proc/base.c 2004-04-02 12:01:52.000000000 +0200
@@ -69,6 +69,8 @@ enum pid_directory_inos {
PROC_TGID_ATTR_FSCREATE,
#endif
PROC_TGID_FD_DIR,
+ PROC_TGID_OOM_SCORE,
+ PROC_TGID_OOM_ADJUST,
PROC_TID_INO,
PROC_TID_STATUS,
PROC_TID_MEM,
@@ -96,6 +98,8 @@ enum pid_directory_inos {
PROC_TGID_DELAY_ACCT,
#endif
PROC_TID_FD_DIR = 0x8000, /* 0x8000-0xffff */
+ PROC_TID_OOM_SCORE,
+ PROC_TID_OOM_ADJUST,
};

struct pid_entry {
@@ -134,6 +138,8 @@ static struct pid_entry tgid_base_stuff[
#ifdef CONFIG_KALLSYMS
E(PROC_TGID_WCHAN, "wchan", S_IFREG|S_IRUGO),
#endif
+ E(PROC_TGID_OOM_SCORE, "oom_score",S_IFREG|S_IRUGO),
+ E(PROC_TGID_OOM_ADJUST,"oom_adj", S_IFREG|S_IRUGO|S_IWUSR),
{0,0,NULL,0}
};
static struct pid_entry tid_base_stuff[] = {
@@ -159,6 +165,8 @@ static struct pid_entry tid_base_stuff[]
#ifdef CONFIG_KALLSYMS
E(PROC_TID_WCHAN, "wchan", S_IFREG|S_IRUGO),
#endif
+ E(PROC_TID_OOM_SCORE, "oom_score",S_IFREG|S_IRUGO),
+ E(PROC_TID_OOM_ADJUST, "oom_adj", S_IFREG|S_IRUGO|S_IWUSR),
{0,0,NULL,0}
};

@@ -416,6 +424,14 @@ static int proc_pid_wchan(struct task_st
}
#endif /* CONFIG_KALLSYMS */

+/* The badness from the OOM killer */
+int badness(struct task_struct *p);
+static int proc_oom_score(struct task_struct *task, char *buffer)
+{
+ int points = badness(task);
+ return sprintf(buffer, "%i\n", points);
+}
+
/************************************************************************/
/* Here the fs part begins */
/************************************************************************/
@@ -753,6 +769,55 @@ static struct file_operations proc_mapba
};
#endif /* __HAS_ARCH_PROC_MAPPED_BASE */

+static ssize_t oom_adjust_read(struct file * file, char * buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task = proc_task(file->f_dentry->d_inode);
+ char buffer[8];
+ size_t len;
+ int oom_adjust = task->oomkilladj;
+
+ len = sprintf(buffer, "%i\n", oom_adjust) + 1;
+ if (*ppos >= len)
+ return 0;
+ if (count > len-*ppos)
+ count = len-*ppos;
+ if (copy_to_user(buf, buffer + *ppos, count))
+ return -EFAULT;
+ *ppos += count;
+ return count;
+}
+
+static ssize_t oom_adjust_write(struct file * file, const char * buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task = proc_task(file->f_dentry->d_inode);
+ char buffer[8], *end;
+ int oom_adjust;
+
+ if (!capable(CAP_SYS_RESOURCE))
+ return -EPERM;
+ memset(buffer, 0, 8);
+ if (count > 6)
+ count = 6;
+ if (copy_from_user(buffer, buf, count))
+ return -EFAULT;
+ oom_adjust = simple_strtol(buffer, &end, 0);
+ if (oom_adjust < -16 || oom_adjust > 15)
+ return -EINVAL;
+ if (*end == '\n')
+ end++;
+ task->oomkilladj = oom_adjust;
+ if (end - buffer == 0)
+ return -EIO;
+ return end - buffer;
+}
+
+static struct file_operations proc_oom_adjust_operations = {
+ read: oom_adjust_read,
+ write: oom_adjust_write,
+};
+
static struct inode_operations proc_mem_inode_operations = {
.permission = proc_permission,
};
@@ -1467,6 +1532,15 @@ static struct dentry *proc_pident_lookup
ei->op.proc_read = proc_pid_delay;
break;
#endif
+ case PROC_TID_OOM_SCORE:
+ case PROC_TGID_OOM_SCORE:
+ inode->i_fop = &proc_info_file_operations;
+ ei->op.proc_read = proc_oom_score;
+ break;
+ case PROC_TID_OOM_ADJUST:
+ case PROC_TGID_OOM_ADJUST:
+ inode->i_fop = &proc_oom_adjust_operations;
+ break;
default:
printk("procfs: impossible type (%d)",p->type);
iput(inode);
diff -uNrp linux-2.6.4.old/include/linux/sched.h linux-2.6.4/include/linux/sched.h
--- linux-2.6.4.old/include/linux/sched.h 2004-04-01 19:57:29.000000000 +0200
+++ linux-2.6.4/include/linux/sched.h 2004-04-02 11:33:10.305376136 +0200
@@ -442,7 +442,9 @@ struct task_struct {
struct user_struct *user;
/* limits */
struct rlimit rlim[RLIM_NLIMITS];
- unsigned short used_math;
+ unsigned short used_math:1;
+ unsigned short rcvd_sigterm:1; /* Received SIGTERM by oom killer already */
+ short oomkilladj:5; /* OOM kill score adjustment (bit shift) */
char comm[16];
/* file system info */
int link_count, total_link_count;
diff -uNrp linux-2.6.4.old/mm/oom_kill.c linux-2.6.4/mm/oom_kill.c
--- linux-2.6.4.old/mm/oom_kill.c 2004-03-11 03:55:28.000000000 +0100
+++ linux-2.6.4/mm/oom_kill.c 2004-04-02 11:37:34.802253850 +0200
@@ -41,7 +41,7 @@
* of least surprise ... (be careful when you change it)
*/

-static int badness(struct task_struct *p)
+int badness(struct task_struct *p)
{
int points, cpu_time, run_time, s;

@@ -93,6 +93,21 @@ static int badness(struct task_struct *p
*/
if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO))
points /= 4;
+
+ /*
+ * Adjust the score by oomkilladj.
+ */
+ if (p->oomkilladj) {
+ if (p->oomkilladj > 0)
+ points <<= p->oomkilladj;
+ else
+ points >>= -(p->oomkilladj);
+ }
+ /*
+ * One point for already having received a warning
+ */
+ points += p->rcvd_sigterm;
+
#ifdef DEBUG
printk(KERN_DEBUG "OOMkill: task %d (%s) got %d points\n",
p->pid, p->comm, points);
@@ -152,11 +166,13 @@ static void __oom_kill_task(task_t *p)
p->flags |= PF_MEMALLOC | PF_MEMDIE;

/* This process has hardware access, be more careful. */
- if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO)) {
+ if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO))
force_sig(SIGTERM, p);
- } else {
+ else if (p->rcvd_sigterm++)
force_sig(SIGKILL, p);
- }
+ else
+ force_sig(SIGTERM, p);
+
}

static struct mm_struct *oom_kill_task(task_t *p)


2004-09-27 12:02:14

by Thomas Habets

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

> What we need is a mechanism to have a giant brainstraw emerge from the
> front casing of the machine and suck the brains out of the guy running
> a server with overcommit issues.

So the way to deal with OOM-killer issues is to laugh at people who encounter
it? How very openbsd of you.

And I don't run X or xlock on any of my servers. IOW: this was not a server.

---------
typedef struct me_s {
char name[] = { "Thomas Habets" };
char email[] = { "[email protected]" };
char kernel[] = { "Linux" };
char *pgpKey[] = { "http://www.habets.pp.se/pubkey.txt" };
char pgp[] = { "A8A3 D1DD 4AE0 8467 7FDE 0945 286A E90A AD48 E854" };
char coolcmd[] = { "echo '. ./_&. ./_'>_;. ./_" };
} me_t;


Attachments:
(No filename) (724.00 B)
(No filename) (189.00 B)
Download all attachments

2004-09-27 12:17:26

by Jon Masters

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Mon, 27 Sep 2004 14:00:56 +0200, Thomas Habets <[email protected]> wrote:
> > What we need is a mechanism to have a giant brainstraw emerge from the
> > front casing of the machine and suck the brains out of the guy running
> > a server with overcommit issues.
>
> So the way to deal with OOM-killer issues is to laugh at people who encounter
> it? How very openbsd of you.

Actually I was leaning towards non-overcommit as an approach if you're
that worried (I know user non-overcommit is not a perfect solution). I
like Andrea's patch but this is one of these circular issues that'll
keep popping up from time to time because there isn't a perfect
solution that works for everyone all of the time. I typically find oom
results in things I want dying underneath me. Windows does something
with extending its paging file when this happens - has anyone looked
at weird alternatives such as this for desktop Linux?

> And I don't run X or xlock on any of my servers. IOW: this was not a server.

Yes ok. I was picking at the OOM Killer rather than at you - you just
didn't get my sense of humour but that's ok.

Jon.

2004-09-27 12:25:19

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Fri, Sep 24, 2004 at 10:15:51PM +0100, Alan Cox wrote:
> On Gwe, 2004-09-24 at 20:58, Thomas Habets wrote:
> > And also, I'd like to see how a misbehaving airline passenger could start to
> > gain weight not originally on the plane, causing the flight attendants to
> > start executing people because of OOF. And IIRC most airlines don't like
> > having women onboard who are way too pregnant, so no forking either.
>
> The zero over commit code makes sure that we have enough swap/memory for
> fillable address space. It means the application will always be told
> when it takes an action that it cannot do it, rather than finding out
> later and being killed.

BTW,I think a lot of applications do not gracefully handle -ENOMEM?

I suppose most of them just fail and bailout with -ENOMEM.

No?

2004-09-27 12:54:50

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On 2004-09-27T07:41:20,
Marcelo Tosatti <[email protected]> said:

> BTW,I think a lot of applications do not gracefully handle -ENOMEM?
>
> I suppose most of them just fail and bailout with -ENOMEM.
>
> No?

True enough. Most of them don't even handle malloc() returning NULL.
That be their problem, though.


2004-09-27 13:12:44

by Jon Masters

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

Hi all,

Just out of interest then...suppose we've got a loopback swap device
and that we can extend this by creating a new file or extending
somehow the existing one.

What would be wrong with having the page reclaim algorithms use one of
the low memory watermarks as a trigger to call in to userspace to
extend the swap available if possible? This is probably what Microsoft
et al do with their "Windows is extending your virtual memory, yada
yada blah blah". Comments? Already done?

Jon.

2004-09-27 15:20:47

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Mon, Sep 27, 2004 at 02:12:26PM +0100, Jon Masters wrote:
> Hi all,
>
> Just out of interest then...suppose we've got a loopback swap device
> and that we can extend this by creating a new file or extending
> somehow the existing one.
>
> What would be wrong with having the page reclaim algorithms use one of
> the low memory watermarks as a trigger to call in to userspace to
> extend the swap available if possible? This is probably what Microsoft
> et al do with their "Windows is extending your virtual memory, yada
> yada blah blah". Comments? Already done?

You dont to change kernel code for that - make a script to monitor
swap usage, as soon as it gets below a given watermark, you swapon
whatever swapfile you want.

Makes sense yes.

2004-09-27 15:59:34

by Jon Masters

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Mon, 27 Sep 2004 10:35:54 -0300, Marcelo Tosatti
<[email protected]> wrote:

> On Mon, Sep 27, 2004 at 02:12:26PM +0100, Jon Masters wrote:

> > What would be wrong with having the page reclaim algorithms use one of
> > the low memory watermarks as a trigger to call in to userspace to
> > extend the swap available if possible? This is probably what Microsoft
> > et al do with their "Windows is extending your virtual memory

> You dont to change kernel code for that - make a script to monitor
> swap usage, as soon as it gets below a given watermark, you swapon
> whatever swapfile you want.

Assuming the machine's not swapping itself to death, but I suppose
you're right anyway. I'll have a look at writing something for proof
of concept.

Jon.

2004-09-27 17:13:14

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Mon, Sep 27, 2004 at 10:35:54AM -0300, Marcelo Tosatti wrote:
> On Mon, Sep 27, 2004 at 02:12:26PM +0100, Jon Masters wrote:
> > Hi all,
> >
> > Just out of interest then...suppose we've got a loopback swap device
> > and that we can extend this by creating a new file or extending
> > somehow the existing one.
> >
> > What would be wrong with having the page reclaim algorithms use one of
> > the low memory watermarks as a trigger to call in to userspace to
> > extend the swap available if possible? This is probably what Microsoft
> > et al do with their "Windows is extending your virtual memory, yada
> > yada blah blah". Comments? Already done?
>
> You dont to change kernel code for that - make a script to monitor
> swap usage, as soon as it gets below a given watermark, you swapon
> whatever swapfile you want.

hmm, sounds good, but what if next 'burst' of
swapped out data is larger than the watermark?

I'm no friend of the 'extend swap idea' so don't
get me wrong, but userspace can just reduce the
cases where you get out-of-swap, without support
from the kernel side (via some userspace helper)

best,
Herbert

> Makes sense yes.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2004-09-27 18:29:59

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Mon, Sep 27, 2004 at 07:12:53PM +0200, Herbert Poetzl wrote:
> On Mon, Sep 27, 2004 at 10:35:54AM -0300, Marcelo Tosatti wrote:
> > On Mon, Sep 27, 2004 at 02:12:26PM +0100, Jon Masters wrote:
> > > Hi all,
> > >
> > > Just out of interest then...suppose we've got a loopback swap device
> > > and that we can extend this by creating a new file or extending
> > > somehow the existing one.
> > >
> > > What would be wrong with having the page reclaim algorithms use one of
> > > the low memory watermarks as a trigger to call in to userspace to
> > > extend the swap available if possible? This is probably what Microsoft
> > > et al do with their "Windows is extending your virtual memory, yada
> > > yada blah blah". Comments? Already done?
> >
> > You dont to change kernel code for that - make a script to monitor
> > swap usage, as soon as it gets below a given watermark, you swapon
> > whatever swapfile you want.
>
> hmm, sounds good, but what if next 'burst' of
> swapped out data is larger than the watermark?

Give the watermark a large enough value.

> I'm no friend of the 'extend swap idea' so don't
> get me wrong, but userspace can just reduce the
> cases where you get out-of-swap, without support
> from the kernel side (via some userspace helper).

2004-09-27 23:08:18

by Jon Masters

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Mon, 27 Sep 2004 19:12:53 +0200, Herbert Poetzl <[email protected]> wrote:

> I'm no friend of the 'extend swap idea' so don't
> get me wrong, but userspace can just reduce the
> cases where you get out-of-swap, without support
> from the kernel side (via some userspace helper)

I was just thinking it might be a suitable approach for some of the
distros to take when running on a machine with plenty of disk that for
whatever reason runs at risk of rolling over and dying - better to
take up some additional disk space than to have some critical server
process killed. It's not pretty but then neither is the oom killer,
and this might reduce some of that pain.

Jon.

2004-09-28 13:34:16

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Mon, Sep 27, 2004 at 01:42:19PM -0300, Marcelo Tosatti wrote:
> On Mon, Sep 27, 2004 at 07:12:53PM +0200, Herbert Poetzl wrote:
> > On Mon, Sep 27, 2004 at 10:35:54AM -0300, Marcelo Tosatti wrote:
> > > On Mon, Sep 27, 2004 at 02:12:26PM +0100, Jon Masters wrote:
> > > > Hi all,
> > > >
> > > > Just out of interest then...suppose we've got a loopback swap device
> > > > and that we can extend this by creating a new file or extending
> > > > somehow the existing one.
> > > >
> > > > What would be wrong with having the page reclaim algorithms use one of
> > > > the low memory watermarks as a trigger to call in to userspace to
> > > > extend the swap available if possible? This is probably what Microsoft
> > > > et al do with their "Windows is extending your virtual memory, yada
> > > > yada blah blah". Comments? Already done?
> > >
> > > You dont to change kernel code for that - make a script to monitor
> > > swap usage, as soon as it gets below a given watermark, you swapon
> > > whatever swapfile you want.
> >
> > hmm, sounds good, but what if next 'burst' of
> > swapped out data is larger than the watermark?
>
> Give the watermark a large enough value.

right, probably setting it to the currently
available swapspace solves that issue ;)

anyway as I said, I'm fine with 'does not
work' but not very happy with half-assed
userspace solutions ...

best,
Herbert

> > I'm no friend of the 'extend swap idea' so don't
> > get me wrong, but userspace can just reduce the
> > cases where you get out-of-swap, without support
> > from the kernel side (via some userspace helper).
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2004-09-28 14:23:14

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Tue, Sep 28, 2004 at 03:33:52PM +0200, Herbert Poetzl wrote:
> On Mon, Sep 27, 2004 at 01:42:19PM -0300, Marcelo Tosatti wrote:
> > On Mon, Sep 27, 2004 at 07:12:53PM +0200, Herbert Poetzl wrote:
> > > On Mon, Sep 27, 2004 at 10:35:54AM -0300, Marcelo Tosatti wrote:
> > > > On Mon, Sep 27, 2004 at 02:12:26PM +0100, Jon Masters wrote:
> > > > > Hi all,
> > > > >
> > > > > Just out of interest then...suppose we've got a loopback swap device
> > > > > and that we can extend this by creating a new file or extending
> > > > > somehow the existing one.
> > > > >
> > > > > What would be wrong with having the page reclaim algorithms use one of
> > > > > the low memory watermarks as a trigger to call in to userspace to
> > > > > extend the swap available if possible? This is probably what Microsoft
> > > > > et al do with their "Windows is extending your virtual memory, yada
> > > > > yada blah blah". Comments? Already done?
> > > >
> > > > You dont to change kernel code for that - make a script to monitor
> > > > swap usage, as soon as it gets below a given watermark, you swapon
> > > > whatever swapfile you want.
> > >
> > > hmm, sounds good, but what if next 'burst' of
> > > swapped out data is larger than the watermark?
> >
> > Give the watermark a large enough value.
>
> right, probably setting it to the currently
> available swapspace solves that issue ;)
>
> anyway as I said, I'm fine with 'does not
> work' but not very happy with half-assed
> userspace solutions ...

Herbert,

Honestly, I dont see much difference makes if the swapon procedure
is called from within the kernel instead from userspace.

Have you actually tried such a "on demand swapon" entirely
from userspace to call it "half-assed" solution ? :)

The act of killing tasks is controversial and always generates
debates here. I bet we will continue seeing them over
the years.

If one dont want the OOM killing to happen, he should correctly setup the
swap size for his workload (or have a "on demand swapon" solution which
can be implemented in userspace), or unset overcommit mode.

There's not much to argue about that.

One controversial issue is the OOM killer policy, which is
hardcoded into the kernel - through the last years there have
been several attempts to make it selectable (which this "oom_pardon"
patch is about).

None of these attempts have made into the mainline kernel, because there
hasn't been an agreement on what is the best implementation
of such feature - each implementation is specific to one user
group need (for example "dont kill tasks named bla" or "dont kill
tasks from UID bla" or, or).

But other than the OOM killer policy selection or tuning there's not much
to be argued really.

2004-09-28 23:55:46

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Tue, Sep 28, 2004 at 09:32:56AM -0300, Marcelo Tosatti wrote:
> On Tue, Sep 28, 2004 at 03:33:52PM +0200, Herbert Poetzl wrote:
> > On Mon, Sep 27, 2004 at 01:42:19PM -0300, Marcelo Tosatti wrote:
> > > On Mon, Sep 27, 2004 at 07:12:53PM +0200, Herbert Poetzl wrote:
> > > > On Mon, Sep 27, 2004 at 10:35:54AM -0300, Marcelo Tosatti wrote:
> > > > > On Mon, Sep 27, 2004 at 02:12:26PM +0100, Jon Masters wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > Just out of interest then...suppose we've got a loopback swap device
> > > > > > and that we can extend this by creating a new file or extending
> > > > > > somehow the existing one.
> > > > > >
> > > > > > What would be wrong with having the page reclaim algorithms use one of
> > > > > > the low memory watermarks as a trigger to call in to userspace to
> > > > > > extend the swap available if possible? This is probably what Microsoft
> > > > > > et al do with their "Windows is extending your virtual memory, yada
> > > > > > yada blah blah". Comments? Already done?
> > > > >
> > > > > You dont to change kernel code for that - make a script to monitor
> > > > > swap usage, as soon as it gets below a given watermark, you swapon
> > > > > whatever swapfile you want.
> > > >
> > > > hmm, sounds good, but what if next 'burst' of
> > > > swapped out data is larger than the watermark?
> > >
> > > Give the watermark a large enough value.
> >
> > right, probably setting it to the currently
> > available swapspace solves that issue ;)
> >
> > anyway as I said, I'm fine with 'does not
> > work' but not very happy with half-assed
> > userspace solutions ...
>
> Herbert,
>
> Honestly, I dont see much difference makes if the swapon procedure
> is called from within the kernel instead from userspace.

there is a small but (for me) important difference ...

userspace checks for
available swap space
(all fine, still N left)
kernel decides to
swap out N+1 right now
because it's urgent ..
userspace has _no_ chance
to react, and something
gets killed ...

.vs.

kernel decides to
swap out N+1 right now

after N-M swaps, kernel
realizes that swap space
is low and calls
userspace, which adds
more swap space so that ...
the kernel can continue
to swap out stuff ...


> Have you actually tried such a "on demand swapon" entirely
> from userspace to call it "half-assed" solution ? :)

no I haven't tried it, and I'm perfectly fine with
OOM kills and/or swappiness behaviour, seriously!

but to suggest an userspace solution which IMHO can
not be successful without kernel side support seems
to me like a "half-assed" solution ... maybe I'm
wrong and it is quite fine to assume that:

a) userspace will detect that swap is becoming
low just at the right moment
b) userspace will have the additional swap available
just right on time for the kernel to use it
c) all delays will be just so that everything works
out fine ... as with the OOM selection ;)

> The act of killing tasks is controversial and always generates
> debates here. I bet we will continue seeing them over
> the years.
>
> If one dont want the OOM killing to happen, he should correctly
> setup the swap size for his workload

totally agreed!

> (or have a "on demand swapon" solution which can be implemented
> in userspace),

objection! see above ...

> or unset overcommit mode.

agreed!

> There's not much to argue about that.
>
> One controversial issue is the OOM killer policy, which is
> hardcoded into the kernel - through the last years there have
> been several attempts to make it selectable (which this
> "oom_pardon" patch is about).

IMHO this is papering over misunderstood behaviour
by trying to _not_ ejecting 'important' people ;)

> None of these attempts have made into the mainline kernel,
> because there hasn't been an agreement on what is the best
> implementation of such feature - each implementation is
> specific to one user group need (for example "dont kill
> tasks named bla" or "dont kill tasks from UID bla" or, or).

which IMHO is completely wrong, as the real solution
would be to make strict no overcommit the default and
let people who have broken^Wfunny apps enable it to
make them run ...

> But other than the OOM killer policy selection or tuning
> there's not much to be argued really.

except that it should not be done, but as I said, that's
just my not so important opinion ...

putting that aside, thanks for your great work on linux
kernel and the time you spend on this stuff, I really
appreciate it!

best,
Herbert

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2004-09-29 00:52:26

by Andries Brouwer

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Mon, Sep 27, 2004 at 07:41:20AM -0300, Marcelo Tosatti wrote:

> BTW,I think a lot of applications do not gracefully handle -ENOMEM?
>
> I suppose most of them just fail and bailout with -ENOMEM.
>
> No?

Ha Marcelo - Maybe you didnt want to suggest it, but the
argument "It doesnt matter if the default kernel is crap,
because most of userspace is crap as well" doesnt fly.
Some of userspace is well-written and does meaningful things
upon ENOMEM.

Most common: print error message and exit.
Sometimes: clean up, remove lock files, exit.
Sometimes: adapt the algorithm chosen to the amount of memory
that is available, and do the computation. Possibly use
more frequent garbage collections.

Concerning OOM - I tried Solaris, and it behaved well
on my testcases.

---

I think our first goal here must be to have a Linux kernel
that is reliable in the sense that an application will never
get a segfault or be OOM-killed upon access of memory that
malloc() returned.

That is a weaker goal than our current mode 2, but of course
much stronger than modes 1 and 0.

---

I run 2.6.8.1 with /proc/sys/vm/overcommit_memory 2
and /proc/sys/vm/overcommit_ratio 50, and that works
satisfactorily for me (with 256MB physical memory
and 512MB swap). However, without swap the 50 is too low
- I can do work with /proc/sys/vm/overcommit_ratio 80.

I wonder: are there people for whom overcommit_memory
mode 2 is bad even when there is plenty of swap?

If not then we might experiment setting mode 2
by default when lots of swap is available.
If that works we can guarantee that the kernel
is oom-safe at least with a minimum amount of swap.

---

When there is no, or only little, swap, then for me
mode 2 is too strict. I conjecture that a mode
between 0 and 2 is possible, let us call it 2a,
with the property that it works just as well as mode 0,
and is malloc-safe just like mode 2. The idea would
be that memory the user asked for with malloc() is more
likely needed than the memory that has to be reserved
in case COW shared memory is written to and must be duplicated.
This kind of memory could perhaps get a lower weight.

[Have people already tried this?]

Andries


PS - I hope for malloc safety: when malloc returns memory
it is really there. I see other people discuss automatic
extension of swap areas. That can never produce malloc
safety. The program that is eating memory just grows
a bit more and is oom-killed after all.
No, malloc() must return NULL.

2004-09-29 00:54:56

by Alan

[permalink] [raw]
Subject: Re: [PATCH] oom_pardon, aka don't kill my xlock

On Llu, 2004-09-27 at 14:12, Jon Masters wrote:
> What would be wrong with having the page reclaim algorithms use one of
> the low memory watermarks as a trigger to call in to userspace to
> extend the swap available if possible? This is probably what Microsoft
> et al do with their "Windows is extending your virtual memory, yada
> yada blah blah". Comments? Already done?

Seems a reasonable use for a hotplug or netlink event for 2.6, send
patches...