2006-12-07 18:31:20

by Chris Friesen

[permalink] [raw]
Subject: additional oom-killer tuneable worth submitting?


The kernel currently has a way to adjust the oom-killer score via
/proc/<pid>/oomadj.

However, to adjust this effectively requires knowledge of the scores of
all the other processes on the system.

I'd like to float an idea (which we've implemented and been using for
some time) where the semantics are slightly different:

We add a new "oom_thresh" member to the task struct.
We introduce a new proc entry "/proc/<pid>/oomthresh" to control it.

The "oom-thresh" value maps to the max expected memory consumption for
that process. As long as a process uses less memory than the specified
threshold, then it is immune to the oom-killer.

On an embedded platform this allows the designer to engineer the system
and protect critical apps based on their expected memory consumption.
If one of those apps goes crazy and starts chewing additional memory
then it becomes vulnerable to the oom killer while the other apps remain
protected.

If a patch for the above feature was submitted, would there be any
chance of getting it included? Maybe controlled by a config option?

Chris


2006-12-07 21:37:56

by Jesper Juhl

[permalink] [raw]
Subject: Re: additional oom-killer tuneable worth submitting?

On 07/12/06, Chris Friesen <[email protected]> wrote:
> Jesper Juhl wrote:
>
> > What happens in the case where the OOM killer really, really needs to
> > kill one or more processes since there is not a single drop of memory
> > available, but all processes are below their configured thresholds?
>
> Then the system wasn't properly engineered. <grin>
>
I had a feeling you'd say that.

> In this case you reboot.
>
I realize that if this case happens the system is misconfigured as far
as oomthresh goes, but if this is a knob that we put in the mainline
kernel then I believe there should be some sort of emergency handling
code that takes this situation into account. Perhaps throw some very
nasty looking log messages and then fall back to the classic OOM
killer behaviour..?


--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-12-07 21:27:00

by Chris Friesen

[permalink] [raw]
Subject: Re: additional oom-killer tuneable worth submitting?

Peter Zijlstra wrote:
> On Thu, 2006-12-07 at 12:30 -0600, Chris Friesen wrote:

>>The "oom-thresh" value maps to the max expected memory consumption for
>>that process. As long as a process uses less memory than the specified
>>threshold, then it is immune to the oom-killer.
>
> You would need to specify the measure of memory used by your process;
> see the (still not resolved) RSS debate.

Currently we simply use mm->total_vm, same as the oom killer.

Chris

2006-12-07 21:25:20

by Chris Friesen

[permalink] [raw]
Subject: Re: additional oom-killer tuneable worth submitting?

Jesper Juhl wrote:

> How does "oomthresh" and "oomadj" affect each other?

If memory consumption is less than "oomthresh", that process is simply
bypassed. (Equivalent to oomkilladj==OOM_DISABLE.) Otherwise, continue
processing as normal.

> Default "oomthresh" value for a new process is 0 (zero) I assume -
> right? If not, then I'd suggest that it should be.

Correct.

> What happens when a process fork()s? Does the child enherit the
> parents "oomthresh" value?

Currently it does not. This is to allow for different memory access
patterns by parent/child. And exec() wipes it as well.

> Would it make sense to make "oomthresh" apply to process groups
> instead of processes?

Hmm...it might make sense given that the point of the group is to manage
tasks together...but it would make accounting more tricky. Currently
it's just a very simple comparison of p->mm->total_vm against the
threshold in badness().

> What happens in the case where the OOM killer really, really needs to
> kill one or more processes since there is not a single drop of memory
> available, but all processes are below their configured thresholds?

Then the system wasn't properly engineered. <grin>

In this case you reboot.

Chris

2006-12-07 18:50:51

by Jesper Juhl

[permalink] [raw]
Subject: Re: additional oom-killer tuneable worth submitting?

A few questions below.

On 07/12/06, Chris Friesen <[email protected]> wrote:
>
> The kernel currently has a way to adjust the oom-killer score via
> /proc/<pid>/oomadj.
>
> However, to adjust this effectively requires knowledge of the scores of
> all the other processes on the system.
>
> I'd like to float an idea (which we've implemented and been using for
> some time) where the semantics are slightly different:
>
> We add a new "oom_thresh" member to the task struct.
> We introduce a new proc entry "/proc/<pid>/oomthresh" to control it.
>

How does "oomthresh" and "oomadj" affect each other?


> The "oom-thresh" value maps to the max expected memory consumption for
> that process. As long as a process uses less memory than the specified
> threshold, then it is immune to the oom-killer.
>

Default "oomthresh" value for a new process is 0 (zero) I assume -
right? If not, then I'd suggest that it should be.

What happens when a process fork()s? Does the child enherit the
parents "oomthresh" value?

Would it make sense to make "oomthresh" apply to process groups
instead of processes?


> On an embedded platform this allows the designer to engineer the system
> and protect critical apps based on their expected memory consumption.
> If one of those apps goes crazy and starts chewing additional memory
> then it becomes vulnerable to the oom killer while the other apps remain
> protected.
>

What happens in the case where the OOM killer really, really needs to
kill one or more processes since there is not a single drop of memory
available, but all processes are below their configured thresholds?


> If a patch for the above feature was submitted, would there be any
> chance of getting it included? Maybe controlled by a config option?

Impossible to know without posting the patch for review :)


--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-12-07 23:21:31

by Chris Friesen

[permalink] [raw]
Subject: Re: additional oom-killer tuneable worth submitting?

Alan wrote:

>>The "oom-thresh" value maps to the max expected memory consumption for
>>that process. As long as a process uses less memory than the specified
>>threshold, then it is immune to the oom-killer.

> You've just introduced a deadlock. What happens if nobody is over that
> predicted memory and the kernel uses more resource ?

Based on the discussion with Jesper, we fall back to regular behaviour.
(Or possibly hang or reboot, if we added another switch).

>>On an embedded platform this allows the designer to engineer the system
>>and protect critical apps based on their expected memory consumption.
>>If one of those apps goes crazy and starts chewing additional memory
>>then it becomes vulnerable to the oom killer while the other apps remain
>>protected.

> That is why we have no-overcommit support. Now there is an argument for
> a meaningful rlimit-as to go with it, and together I think they do what
> you really need.

No overcommit only protects the system as a whole, not any particular
processes. The purpose of this is to protect specific daemons from
being killed when the system as a whole is short on memory. Same
rationale as for oomadj, but different knob to twiddle.

Chris

2006-12-07 22:25:10

by Jesper Juhl

[permalink] [raw]
Subject: Re: additional oom-killer tuneable worth submitting?

On 07/12/06, Chris Friesen <[email protected]> wrote:
> Jesper Juhl wrote:
> >> Jesper Juhl wrote:
>
> >> > What happens in the case where the OOM killer really, really needs to
> >> > kill one or more processes since there is not a single drop of memory
> >> > available, but all processes are below their configured thresholds?
>
> > I realize that if this case happens the system is misconfigured as far
> > as oomthresh goes, but if this is a knob that we put in the mainline
> > kernel then I believe there should be some sort of emergency handling
> > code that takes this situation into account. Perhaps throw some very
> > nasty looking log messages and then fall back to the classic OOM
> > killer behaviour..?
>
> Yeah, I can see that the reboot might be a bit drastic for mainline. I
> think the fallback to classic behaviour might work okay.
>
> Anyway, the chances of hitting that case are likely pretty slim. The
> way we've been using this is to only set the threshold for fairly
> important long-lived daemons. Much of the "standard" stuff (shell, cat,
> cp, mv, etc.) is left unprotected.
>
Sure, that's sensible, to only protect the important stuff.
But even if the chances of hitting this are slim, we still need a way
out. For most people anything is better than a hung box.

Some examples;

For a desktop (where people may be experimenting with the feature) -
seeing your firefox process evaporate due to the OOM killer and then
finding a message explaining what happened in dmesg is a lot less
frustrating than a hang or sudden reboot.

For a server - If you mis-configure the new feature you may be in for
a long drive to reboot a box whereas falling back to the classic OOM
killer (+ nasty messages in dmesg) will likely save you the trip and
clue you in as to what you mis-configured.

For an embedded box - triggering a reboot would probably be better
than both a hang or classic OOM kill in many cases (better to have the
device reboot and come back working than to hang or start
malfunctioning due to a missing process).

So maybe what's needed is an additional knob for people to tweak - one
that selects what should happen in this rare case: 1) fallback to
classic OOM (default), 2) reboot, 3) hang. In all cases messages
should be logged explaining what happened.
Or is that overkill? If so I'd personally prefer just falling back to
classic OOM kill in this case.

A way out for the "OOM but all processes below threshold" case +
perhaps coupled with oomthresh applying to process groups instead of
just processes and I personally start to like this feature.

Let's see some code...

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-12-07 21:57:49

by Chris Friesen

[permalink] [raw]
Subject: Re: additional oom-killer tuneable worth submitting?

Jesper Juhl wrote:
>> Jesper Juhl wrote:

>> > What happens in the case where the OOM killer really, really needs to
>> > kill one or more processes since there is not a single drop of memory
>> > available, but all processes are below their configured thresholds?

> I realize that if this case happens the system is misconfigured as far
> as oomthresh goes, but if this is a knob that we put in the mainline
> kernel then I believe there should be some sort of emergency handling
> code that takes this situation into account. Perhaps throw some very
> nasty looking log messages and then fall back to the classic OOM
> killer behaviour..?

Yeah, I can see that the reboot might be a bit drastic for mainline. I
think the fallback to classic behaviour might work okay.

Anyway, the chances of hitting that case are likely pretty slim. The
way we've been using this is to only set the threshold for fairly
important long-lived daemons. Much of the "standard" stuff (shell, cat,
cp, mv, etc.) is left unprotected.

Chris

2006-12-07 19:21:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: additional oom-killer tuneable worth submitting?

On Thu, 2006-12-07 at 12:30 -0600, Chris Friesen wrote:
> The kernel currently has a way to adjust the oom-killer score via
> /proc/<pid>/oomadj.
>
> However, to adjust this effectively requires knowledge of the scores of
> all the other processes on the system.
>
> I'd like to float an idea (which we've implemented and been using for
> some time) where the semantics are slightly different:
>
> We add a new "oom_thresh" member to the task struct.
> We introduce a new proc entry "/proc/<pid>/oomthresh" to control it.
>
> The "oom-thresh" value maps to the max expected memory consumption for
> that process. As long as a process uses less memory than the specified
> threshold, then it is immune to the oom-killer.

You would need to specify the measure of memory used by your process;
see the (still not resolved) RSS debate.

> On an embedded platform this allows the designer to engineer the system
> and protect critical apps based on their expected memory consumption.
> If one of those apps goes crazy and starts chewing additional memory
> then it becomes vulnerable to the oom killer while the other apps remain
> protected.
>
> If a patch for the above feature was submitted, would there be any
> chance of getting it included? Maybe controlled by a config option?


2006-12-07 23:14:34

by Alan

[permalink] [raw]
Subject: Re: additional oom-killer tuneable worth submitting?

> We add a new "oom_thresh" member to the task struct.
> We introduce a new proc entry "/proc/<pid>/oomthresh" to control it.
>
> The "oom-thresh" value maps to the max expected memory consumption for
> that process. As long as a process uses less memory than the specified
> threshold, then it is immune to the oom-killer.

You've just introduced a deadlock. What happens if nobody is over that
predicted memory and the kernel uses more resource ?
>
> On an embedded platform this allows the designer to engineer the system
> and protect critical apps based on their expected memory consumption.
> If one of those apps goes crazy and starts chewing additional memory
> then it becomes vulnerable to the oom killer while the other apps remain
> protected.

That is why we have no-overcommit support. Now there is an argument for
a meaningful rlimit-as to go with it, and together I think they do what
you really need.

Alan