i thought it would be nice to finally hear something good about the OOM
killer.
i am testing Evolution (Ximian's GNOME emailer/groupware app), and the
latest Evolution cvs-snapshot went crazy when trying to copy a mail
folder. my load averaged spiked, swap filled, and then i ran out of
memory.
*poof*, Evolution was killed, and everything returned to normal.
kernel showed:
Out of Memory: Killed process 1296 (evolution-mail).
Out of Memory: Killed process 1296 (evolution-mail).
Out of Memory: Killed process 1296 (evolution-mail).
Out of Memory: Killed process 1302 (evolution-mail).
Out of Memory: Killed process 1303 (evolution-mail).
Out of Memory: Killed process 1306 (evolution-mail).
Out of Memory: Killed process 1307 (evolution-mail).
now, i dont know if the load average spiking was evolution's fault, or
not...but everything seemed to work. Good job.
--
Robert M. Love
rml at ufl.edu
rml at tech9.net
On Saturday, 07 July 2001, at 18:00:08 -0400,
Robert Love wrote:
> i thought it would be nice to finally hear something good about the OOM
> killer.
> [...]
> kernel showed:
> Out of Memory: Killed process 1296 (evolution-mail).
> [...]
> Out of Memory: Killed process 1307 (evolution-mail).
>
<BEWARE: NEWBIE REASONING AHEAD>
I've had both succesful and not-so-sucesful times with 2.4.x's OOM killer.
Having looked at oom_kill.c code, from my newbie point of view, it _seems_
that theoretically we try to kill a process too late (that is, in
out_of_memory() we report OOM when swap is full AND memory has
freepages.min or less 4KB pages left).
I've seem some applications cause some memory to be reserved when told to
exit normally (i.e. Mozilla). If we are OOM we just have freepages.min
pages free, that AFAIK is the first number under /proc/sys/vm/freepages,
or 512 KB worth of memory. Maybe this is not enough in some situations,
and that colud cause the machine to slow badly trying to kill something
that needs free memory, when in fact we have not free memory at all.
</BEWARE>
Another interesting thing I noted is the fact (as shown by Robert Love's
message) that oom_kill() seems to kill processes without taking into
account whether the selected process is a full application or just one
of more "threads" in some application. This happened to me when OpenOffice
went crazy and OOM hit, but instead of killing the parent process, it just
killed one of the children and, though OOM recoverd memory, OpenOffice
ended useless. Maybe OOM should have killed the parent in the first place.
Final question: a 2.4.4 kernel with no swap activated, and OOM hit (thanks
to a purposedly executed ls ../*/../*/..) takes much more time to recover
than the same setup but with swap activated (exact numbers missing,
sorry). Moreover, when swap is of, the hard disk goes crazy as if it where
using swap, when in fact it isn't). Is this expected behaviour ?
If someone wants some test with real numbers, please let me know and
though I'm on vacation, I'll go where I work to make some test :)
Regards.
--
Jos? Luis Domingo L?pez
Linux Registered User #189436 Debian GNU/Linux Potato (P166 64 MB RAM)
jdomingo EN internautas PUNTO org => ? Spam ? Atente a las consecuencias
jdomingo AT internautas DOT org => Spam at your own risk
On 08 Jul 2001 00:40:51 +0000, Jos? Luis Domingo L?pez wrote:
> <snip>
> Another interesting thing I noted is the fact (as shown by Robert Love's
> message) that oom_kill() seems to kill processes without taking into
> account whether the selected process is a full application or just one
> of more "threads" in some application. This happened to me when OpenOffice
> went crazy and OOM hit, but instead of killing the parent process, it just
> killed one of the children and, though OOM recoverd memory, OpenOffice
> ended useless. Maybe OOM should have killed the parent in the first place.
for whatever reason, i did not even notice this. i guess because
evolution itself exited, for some reason (normally if a single component
dies, say mail, it just puts a dialog up saying `mail component died').
i think there may be problems with determining what the parent app is,
or if there is a parent app. killing the PPID may not always be the
answer (but in many cases, like what you gave, is a very good answer).
> Final question: a 2.4.4 kernel with no swap activated, and OOM hit (thanks
> to a purposedly executed ls ../*/../*/..) takes much more time to recover
> than the same setup but with swap activated (exact numbers missing,
> sorry). Moreover, when swap is of, the hard disk goes crazy as if it where
> using swap, when in fact it isn't). Is this expected behaviour ?
i think i recall hearing about this, and the reply was something to the
effect of `its known but not wanted'.
> If someone wants some test with real numbers, please let me know and
> though I'm on vacation, I'll go where I work to make some test :)
i forgot to give any stats from my incident. i couldnt access the
console (the machine was almost locked, the mouse barely moved), so i
dont have any hard numbers.
from my gnome applets <g> i see load was approaching 10, memory was (or
was close to) 100%, and swap was growing close to 100%.
this is kernel 2.4.6-ac2, x86, with 256MB memory, 768MB swap.
after the incident memory was done to the bare load with only 30MB of
cache and swap was only at about 20MB use.
i restarted X but not the system, and all is well.
--
Robert M. Love
rml at ufl.edu
rml at tech9.net
> Moreover, when swap is of, the hard disk
> goes crazy as if it where using swap, when in fact it isn't). Is this
> expected behaviour ?
Yes, it's recovering memory by dropping program text pages (memory
mapped from elf files) so those have to be reloaded when the program
executes them again.
--
Daniel