2003-02-22 19:31:43

by Marc-Christian Petersen

[permalink] [raw]
Subject: oom killer and its superior braindamage in 2.4

Hi all,

I just thought (ok it was yesterday) about stress testing my mysql db.
I used this:
- mystress.pl localhost mysql root test 600 300 60 "select * from user"

It worked like a charme. So I tried:
- mystress.pl localhost mysql root test 1800 900 60 "select * from user"

My machine has 512MB RAM and 512MB SWAP.

I expected that the 2nd run will OOM my machine but I did not expect this
silly behaviour.

The following log entry appeared only _once_ (there were ~700 mysqld running)

- Feb 21 10:03:22 codeman kernel: Out of Memory: Killed process 1463 (mysqld).


Instead of really killing either mysqld or mystress.pl the OOM killer decided
to kill apache (apache did nothing but had 5 threads sleeping)

- Feb 21 10:04:57 codeman kernel: Out of Memory: Killed process 2657 (apache).

The above log entry (apache) appeared for about 4 hours every some seconds
(same PID) until I thought about sysrq-b to get out of this braindead
behaviour. The machine was somewhat dead for me because I was not able to do
anything but sysrq. The system itself was _not_ dead, there was massive disk
i/o. This is 2.4.20 vanilla.

Is there any chance we can fix this up?

ciao, Marc



2003-02-22 20:04:33

by Rik van Riel

[permalink] [raw]
Subject: Re: oom killer and its superior braindamage in 2.4

On Sat, 22 Feb 2003, Marc-Christian Petersen wrote:

> - Feb 21 10:04:57 codeman kernel: Out of Memory: Killed process 2657 (apache).
>
> The above log entry (apache) appeared for about 4 hours every some
> seconds (same PID) until I thought about sysrq-b

> Is there any chance we can fix this up?

Yes.

1) add a VM_KILLED flag
2) set this flag on the p->mm->def_flags when you kill a
process/thread from oom_kill.c
3) clear the flag on process exit

4) on a new call to oom_kill, skip processes when
(p->mm->def_flags & VM_KILLED) by returning 0
points for such a process

cheers,

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-02-22 20:22:34

by Rik van Riel

[permalink] [raw]
Subject: Re: oom killer and its superior braindamage in 2.4

On Sat, 22 Feb 2003, Rik van Riel wrote:
> On Sat, 22 Feb 2003, Marc-Christian Petersen wrote:
>
> > - Feb 21 10:04:57 codeman kernel: Out of Memory: Killed process 2657 (apache).
> >
> > The above log entry (apache) appeared for about 4 hours every some
> > seconds (same PID) until I thought about sysrq-b
>
> > Is there any chance we can fix this up?
>
> Yes.

Never mind my last idea, it can be done much simpler ;)

Does the below patch fix your problem ?

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/


===== mm/oom_kill.c 1.11 vs edited =====
--- 1.11/mm/oom_kill.c Fri Aug 16 10:59:46 2002
+++ edited/mm/oom_kill.c Sat Feb 22 17:31:49 2003
@@ -61,6 +61,9 @@

if (!p->mm)
return 0;
+
+ if (p->flags & PF_MEMDIE)
+ return 0;
/*
* The memory size of the process is the basis for the badness.
*/

2003-02-23 17:27:48

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: oom killer and its superior braindamage in 2.4

On Saturday 22 February 2003 21:32, Rik van Riel wrote:

Hi Rik,

> > > - Feb 21 10:04:57 codeman kernel: Out of Memory: Killed process 2657
> > > (apache).
> > > The above log entry (apache) appeared for about 4 hours every some
> > > seconds (same PID) until I thought about sysrq-b
> > > Is there any chance we can fix this up?
> > Yes.
> Never mind my last idea, it can be done much simpler ;)
hehe :)

> Does the below patch fix your problem ?
Well, this makes a difference. I filled up my memory with something else
before starting mystress.pl because of top's|ps' slowness with many processes.
I had about 400 processes. The test from yesterday had ~ 1800.

With your patch, mystress.pl was marked to get killed, every PID only once, no
apache or similar (good). ... But the strange thing is, that it seems none of
the processes, which are marked to be killed, get killed. So sysrq-t tells
me. Sysrq-i gave me the chance to get out of the OOM killing process and only
kernel threads were left + getty's so I was able to log in again.

ciao, Marc

2003-02-23 20:08:25

by Rik van Riel

[permalink] [raw]
Subject: Re: oom killer and its superior braindamage in 2.4

On Sun, 23 Feb 2003, Marc-Christian Petersen wrote:

> > Does the below patch fix your problem ?

> With your patch, mystress.pl was marked to get killed, every PID only
> once, no apache or similar (good). ... But the strange thing is, that it
> seems none of the processes, which are marked to be killed, get killed.
> So sysrq-t tells me.

It'd be interesting to know where these processes are spending
their CPU time and why they're not catching their signals.

> Sysrq-i gave me the chance to get out of the OOM killing process and
> only kernel threads were left + getty's so I was able to log in again.

Strange, so sysrq-i manages to kill the processes, but the OOM
killer doesn't kill the processes ?

This is very suspect because the OOM killer uses force_sig in
the same way the sysrq-i handler does...

regards,

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-02-23 20:21:15

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: oom killer and its superior braindamage in 2.4

On Sunday 23 February 2003 21:18, Rik van Riel wrote:

Hi Rik,

> It'd be interesting to know where these processes are spending
> their CPU time and why they're not catching their signals.
I'll look into it again when I do the next run.

> > Sysrq-i gave me the chance to get out of the OOM killing process and
> > only kernel threads were left + getty's so I was able to log in again.
> Strange, so sysrq-i manages to kill the processes, but the OOM
> killer doesn't kill the processes ?
yep, so it is.

> This is very suspect because the OOM killer uses force_sig in
> the same way the sysrq-i handler does...
indeed. Well, sysrq-i need about 5 seconds to give me my getty back.

Anyway, your patch should go into -BK. Your patch does _not_ introduce this
behaviour, it's present even w/o your patch but your approach makes things
better :)

ciao, Marc