2003-02-23 15:55:59

by David Mansfield

[permalink] [raw]
Subject: oom killer and its superior braindamage in 2.4


Marc, Rik,

> - Feb 21 10:04:57 codeman kernel: Out of Memory: Killed process 2657
> (apache).
>
> The above log entry (apache) appeared for about 4 hours every some
> seconds (same PID) until I thought about sysrq-b to get out of this
> braindead behaviour. The machine was somewhat dead for me because I was
> not able to do anything but sysrq. The system itself was _not_ dead,
> there was massive disk i/o. This is 2.4.20 vanilla.

This exact thing happened to me as well, on a 2.4.20-pre that hasn't been
upgraded to 2.4.20 yet. The thing that concerns me most is:

Why won't the system kill the process it claims to be killing?

If, in Marc's case, the system wants to kill PID 2657, a lowly sleeping
apache process, why can't it? This is a bug for sure.

For me, there was some python process chosen as the one for killing and it
repeated the 'Out of Memory: Killed process xxxxx (python)' for hours
while making no progress. The machine was still routing packets but I
couldn't log in. Sys-rq was disabled, so I was forced to use the big red
button.

Rik, any ideas?

David

--
/==============================\
| David Mansfield |
| [email protected] |
\==============================/


2003-02-23 16:15:00

by Faik Uygur

[permalink] [raw]
Subject: Re: oom killer and its superior braindamage in 2.4

> This exact thing happened to me as well, on a 2.4.20-pre that hasn't been
> upgraded to 2.4.20 yet. The thing that concerns me most is:
>
> Why won't the system kill the process it claims to be killing?
>
> If, in Marc's case, the system wants to kill PID 2657, a lowly sleeping
> apache process, why can't it? This is a bug for sure.
>
> For me, there was some python process chosen as the one for killing and it
> repeated the 'Out of Memory: Killed process xxxxx (python)' for hours
> while making no progress. The machine was still routing packets but I
> couldn't log in. Sys-rq was disabled, so I was forced to use the big red
> button.
>
> Rik, any ideas?

But, did you follow that thread? Rik van Riel, already suggested a solution for
the problem.

http://marc.theaimsgroup.com/?l=linux-kernel&m=104594301523518&w=2








2003-02-23 16:15:57

by Rik van Riel

[permalink] [raw]
Subject: Re: oom killer and its superior braindamage in 2.4

On Sun, 23 Feb 2003, David Mansfield wrote:

> Rik, any ideas?

You could try the patch I sent to Marc and linux-kernel
yesterday afternoon ;)

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-02-23 17:58:11

by David Mansfield

[permalink] [raw]
Subject: Re: oom killer and its superior braindamage in 2.4



> On Sun, 23 Feb 2003, David Mansfield wrote:
>
> > Rik, any ideas?
>
> You could try the patch I sent to Marc and linux-kernel
> yesterday afternoon ;)
>

You miss my point completely. The kernel has ALREADY chosen a task to
kill. I don't care to adjust the 'badness' function. The kernel has
already chosen a bad task.

If you read my post, the bug is that the kernel CANNOT kill that process?
Why? If it's really a bad process, shouldn't it be the one that gets
killed?

With you patch we have:

1) Kernel goes OOM
2) Kernel picks the worst task to kill using badness()
3) Kernel attempts to kill this task but fails due to some {reason|bug}.
4) Kernel now picks some other task to kill even though the 'baddest' one
is allowed to hang out.

This is my question, and I don't see how the patch addresses it.

David



--
/==============================\
| David Mansfield |
| [email protected] |
\==============================/

2003-02-23 20:05:14

by Rik van Riel

[permalink] [raw]
Subject: Re: oom killer and its superior braindamage in 2.4

On Sun, 23 Feb 2003, David Mansfield wrote:

> If you read my post, the bug is that the kernel CANNOT kill that
> process? Why? If it's really a bad process, shouldn't it be the one
> that gets killed?

> This is my question, and I don't see how the patch addresses it.

And you won't see one, either. You cannot change the
semantics of uninterruptible sleep, nor can the OOM
killer change other device driver things.

This means the OOM killer has little choice but to
"hope for the best" and pick another process if the
first process chosen can't exit.

If you think you can fix all drivers to work fine
when tasks suddenly disappear, I guess you might
wnat to create such a patch ...

regards,

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-02-23 20:12:10

by David Mansfield

[permalink] [raw]
Subject: Re: oom killer and its superior braindamage in 2.4

>
> > If you read my post, the bug is that the kernel CANNOT kill that
> > process? Why? If it's really a bad process, shouldn't it be the one
> > that gets killed?
>
> > This is my question, and I don't see how the patch addresses it.
>
> And you won't see one, either. You cannot change the
> semantics of uninterruptible sleep, nor can the OOM
> killer change other device driver things.

So you're saying that a process can stay in the D state, without ever
getting enough resources to complete a single Uninteruptible wait, for
hours at a time?

Ok. Now I understand your patch. Thanks for the info.

You should push your patch to Marcelo.

Thanks,
David

--
/==============================\
| David Mansfield |
| [email protected] |
\==============================/

2003-02-23 20:44:13

by Rik van Riel

[permalink] [raw]
Subject: Re: oom killer and its superior braindamage in 2.4

On Sun, 23 Feb 2003, David Mansfield wrote:

> So you're saying that a process can stay in the D state, without ever
> getting enough resources to complete a single Uninteruptible wait, for
> hours at a time?

Or even in the R state, but that would only happen when there
is a kernel bug. The OOM killer can do nothing but hope for
the best and try another process if the first one doesn't want
to exit.

> Ok. Now I understand your patch. Thanks for the info.
>
> You should push your patch to Marcelo.

Will do.

cheers,

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-02-24 09:03:33

by Mikael Starvik

[permalink] [raw]
Subject: RE: oom killer and its superior braindamage in 2.4

Does everyone agree that killing a process is always the best approach
to resolve an OOM? If the OOM is caused by e.g. a growing tmpfs or
memory leaks in the kernel it won't help much to kill processes that
may respawn.

Would it be useful if it was possible to register another oom-handler?
Some architectures could then choose to e.g. reboot the system instead.

/Mikael

-----Original Message-----
From: [email protected]
[mailto:[email protected]]On Behalf Of Marc-Christian
Petersen
Sent: Saturday, February 22, 2003 8:35 PM
To: [email protected]
Subject: oom killer and its superior braindamage in 2.4


Hi all,

I just thought (ok it was yesterday) about stress testing my mysql db.
I used this:
- mystress.pl localhost mysql root test 600 300 60 "select * from user"

It worked like a charme. So I tried:
- mystress.pl localhost mysql root test 1800 900 60 "select * from user"

My machine has 512MB RAM and 512MB SWAP.

I expected that the 2nd run will OOM my machine but I did not expect this
silly behaviour.

The following log entry appeared only _once_ (there were ~700 mysqld running)

- Feb 21 10:03:22 codeman kernel: Out of Memory: Killed process 1463 (mysqld).


Instead of really killing either mysqld or mystress.pl the OOM killer decided
to kill apache (apache did nothing but had 5 threads sleeping)

- Feb 21 10:04:57 codeman kernel: Out of Memory: Killed process 2657 (apache).

The above log entry (apache) appeared for about 4 hours every some seconds
(same PID) until I thought about sysrq-b to get out of this braindead
behaviour. The machine was somewhat dead for me because I was not able to do
anything but sysrq. The system itself was _not_ dead, there was massive disk
i/o. This is 2.4.20 vanilla.

Is there any chance we can fix this up?

ciao, Marc


2003-02-24 09:51:04

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: oom killer and its superior braindamage in 2.4

On Monday 24 February 2003 10:13, Mikael Starvik wrote:

Hi Mikael,

> Does everyone agree that killing a process is always the best approach
> to resolve an OOM? If the OOM is caused by e.g. a growing tmpfs or
> memory leaks in the kernel it won't help much to kill processes that
> may respawn.
Well, I don't agree that it's always the best approach. Other bad things, you
metioned it, can happen.

> Would it be useful if it was possible to register another oom-handler?
> Some architectures could then choose to e.g. reboot the system instead.
I'd like to see _an option_ (read: not default but an option, e.g. boot
parameter) that will reboot the machine after $specified_time if an OOM
killing action does not stop.

ciao, Marc