2005-03-08 15:21:35

by Christian Kujau

[permalink] [raw]
Subject: oom with 2.6.11

hallo list,

today my machine went out out memory and noticing it several hours after
the first OOM message in the log, i wonder
1) why this happened at all and
2) why almost every service was killed despite the clever algorithms
documented in mm/oom_kill.c.

the first oom message went to the syslog at 01:27, i was away and no heavy
tasks were scheduled:

http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.txt

mysqld got killed by the oom killer, so i have to suspect mysql for being
the reason for oom here, even that i know that mysqld is running all day
long. several other tasks got killed, but "Free swap" stays at 0kB and the
oom killer kills almost every other tasks, with no success in freeing ram.

the log stops at 03:21, perhaps syslog-ng got killed.
at around 07:31 i noticed the mess, did SYSRQ-E and now i was able to
login again. i pressed SYSRQ-M/T/P too, they are all in the log. at this
time loadavg was at 249 ;)

i went to runlevel 2, then up again to 3 and all services are up and
running again.

some 2.6.11-rc3 BK snapshot was running pretty stable (no OOM) for ~30
days before i switched to 2.6.11 (vanilla) a few days ago. i have to (not)
reproduce the problem the next night, i wonder if it will happen again.

do you vm-gurus have any idea to the points asked above?

more infos about the box here: http://nerdbynature.de/bits/sheep/2.6.11/oom/


thank you for your comments,
Christian.
--
BOFH excuse #281:

The co-locator cannot verify the frame-relay gateway to the ISDN server.


2005-03-09 13:19:31

by Mauricio Lin

[permalink] [raw]
Subject: Re: oom with 2.6.11

Hi Christian,

Could you check the mm/oom_kill.c for your kernel 2.6.11-rc3?

During the 2.6.11-rc development, the oom killer was changed by Andrea
Arcangeli. I do not remember exactly which was the version that this
modification was included, perhaps in kernel 2.6.11-rc4.

Now this oom killer modification is part of 2.6.11 vanilla kernel.

Send the mm/oom_kill.c of 2.6.11-rc3 to me, please. Let me confirm my doubt.

BR,

Mauricio Lin.

On Tue, 08 Mar 2005 16:21:21 +0100, Christian Kujau <[email protected]> wrote:
> hallo list,
>
> today my machine went out out memory and noticing it several hours after
> the first OOM message in the log, i wonder
> 1) why this happened at all and
> 2) why almost every service was killed despite the clever algorithms
> documented in mm/oom_kill.c.
>
> the first oom message went to the syslog at 01:27, i was away and no heavy
> tasks were scheduled:
>
> http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.txt
>
> mysqld got killed by the oom killer, so i have to suspect mysql for being
> the reason for oom here, even that i know that mysqld is running all day
> long. several other tasks got killed, but "Free swap" stays at 0kB and the
> oom killer kills almost every other tasks, with no success in freeing ram.
>
> the log stops at 03:21, perhaps syslog-ng got killed.
> at around 07:31 i noticed the mess, did SYSRQ-E and now i was able to
> login again. i pressed SYSRQ-M/T/P too, they are all in the log. at this
> time loadavg was at 249 ;)
>
> i went to runlevel 2, then up again to 3 and all services are up and
> running again.
>
> some 2.6.11-rc3 BK snapshot was running pretty stable (no OOM) for ~30
> days before i switched to 2.6.11 (vanilla) a few days ago. i have to (not)
> reproduce the problem the next night, i wonder if it will happen again.
>
> do you vm-gurus have any idea to the points asked above?
>
> more infos about the box here: http://nerdbynature.de/bits/sheep/2.6.11/oom/
>
> thank you for your comments,
> Christian.
> --
> BOFH excuse #281:
>
> The co-locator cannot verify the frame-relay gateway to the ISDN server.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2005-03-09 13:41:14

by Mauricio Lin

[permalink] [raw]
Subject: Re: oom with 2.6.11

Hi Christian,

I found the 2.6.11-rc3 patch. The oom killer modification from
Arcangeli was included in 2.6.11-rc3. Right? So this is correct, so
the problem is not related to Arcangeli modification.

Does anyone have idea?

BR,

Mauricio Lin.

On Wed, 9 Mar 2005 09:18:31 -0400, Mauricio Lin <[email protected]> wrote:
> Hi Christian,
>
> Could you check the mm/oom_kill.c for your kernel 2.6.11-rc3?
>
> During the 2.6.11-rc development, the oom killer was changed by Andrea
> Arcangeli. I do not remember exactly which was the version that this
> modification was included, perhaps in kernel 2.6.11-rc4.
>
> Now this oom killer modification is part of 2.6.11 vanilla kernel.
>
> Send the mm/oom_kill.c of 2.6.11-rc3 to me, please. Let me confirm my doubt.
>
> BR,
>
> Mauricio Lin.
>
> On Tue, 08 Mar 2005 16:21:21 +0100, Christian Kujau <[email protected]> wrote:
> > hallo list,
> >
> > today my machine went out out memory and noticing it several hours after
> > the first OOM message in the log, i wonder
> > 1) why this happened at all and
> > 2) why almost every service was killed despite the clever algorithms
> > documented in mm/oom_kill.c.
> >
> > the first oom message went to the syslog at 01:27, i was away and no heavy
> > tasks were scheduled:
> >
> > http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.txt
> >
> > mysqld got killed by the oom killer, so i have to suspect mysql for being
> > the reason for oom here, even that i know that mysqld is running all day
> > long. several other tasks got killed, but "Free swap" stays at 0kB and the
> > oom killer kills almost every other tasks, with no success in freeing ram.
> >
> > the log stops at 03:21, perhaps syslog-ng got killed.
> > at around 07:31 i noticed the mess, did SYSRQ-E and now i was able to
> > login again. i pressed SYSRQ-M/T/P too, they are all in the log. at this
> > time loadavg was at 249 ;)
> >
> > i went to runlevel 2, then up again to 3 and all services are up and
> > running again.
> >
> > some 2.6.11-rc3 BK snapshot was running pretty stable (no OOM) for ~30
> > days before i switched to 2.6.11 (vanilla) a few days ago. i have to (not)
> > reproduce the problem the next night, i wonder if it will happen again.
> >
> > do you vm-gurus have any idea to the points asked above?
> >
> > more infos about the box here: http://nerdbynature.de/bits/sheep/2.6.11/oom/
> >
> > thank you for your comments,
> > Christian.
> > --
> > BOFH excuse #281:
> >
> > The co-locator cannot verify the frame-relay gateway to the ISDN server.
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>

2005-03-09 14:00:20

by Christian Kujau

[permalink] [raw]
Subject: Re: oom with 2.6.11

Mauricio Lin wrote:
> Hi Christian,
>
> I found the 2.6.11-rc3 patch. The oom killer modification from
> Arcangeli was included in 2.6.11-rc3. Right? So this is correct, so
> the problem is not related to Arcangeli modification.
>
> Does anyone have idea?

hi Mauricio,

thank you for your answers, but i'm unable to investigate right now, will
do so in the evening.

thanks,
Christian.
--
BOFH excuse #240:

Too many little pins on CPU confusing it, bend back and forth until 10-20%
are neatly removed. Do _not_ leave metal bits visible!

2005-03-10 15:12:38

by Christian Kujau

[permalink] [raw]
Subject: Re: oom with 2.6.11

ok,

as "promised", it the OOM happened again with the same plain 2.6.11,
details here.

http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11_2.txt

the following is a quite long, but please read on
(if anyone is reading at all :))

this time it happened at 08:01, and i could image some heavy cron jobs
were going on. but as i said: "it did not happen before". there are also
output of SYSRQ-T/M/P. i did SYSRQ-E to recover the machine, but then
decided to reboot back to 2.6.11-rc5-bk2.

i had a look at the changelogs too and noticed that ChangeLog-2.6.11
contains 7 occurrences of "OOM" in the patch desctiption:

[PATCH] mm: overcommit updates, 2005-01-03
[PATCH] vmscan: count writeback pages in nr_scanned, 2005-01-08
[PATCH] possible rq starvation on oom, 2005-01-13
[PATCH] mm: adjust dirty threshold for lowmem-only mappings, 2005-01-25
[PATCH] mm: oom-killer tunable, 2005-02-02
[PATCH] mm: fix several oom killer bugs, 2005-02-02
[PATCH] Fix oops in alloc_zeroed_user_highpage() when [...],2005-02-09

release dates:
2.6.11-rc5-bk1 26-Feb-2005
2.6.11-rc5-bk2 27-Feb-2005 <
2.6.11-rc5-bk3 28-Feb-2005
2.6.11-rc5-bk4 01-Mar-2005
2.6.11 02-Mar-2005

so i really don't see any patches that *could* have something to do with
the issue here.

now comes the weird part:

i was going to compile 2.6.11-rc5-bk4, to sort out the "bad" kernel.
compiling went fine. ok, finished some email, ok, suddenly my swap was
used up again, and no memory left - uh oh! OOM again, with 2.6.11-rc5-bk2!

to summarize it:
i've run 2.6.11-rc2-bk10 during whole february, then switched to
2.6.11-rc5-bk2 on 28.02.2005, then to 2.6.11 on 05.03.2005 - and only
noticed with 2.6.11 first, now with 2.6.11-rc5-bk2 too.

there is an interesting part in the logfiles:

http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.txt
http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11_2.txt
http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11-rc5-bk2.txt

every last message before the "OOM" messages is something with pppd:

Mar 10 13:45:55 sheep pppd[1567]: Starting link
Mar 10 14:12:29 sheep kernel: oom-killer: gfp_mask=0x1d2

Mar 8 00:59:58 sheep pppd[418]: Starting link
Mar 8 01:27:33 sheep kernel: oom-killer: gfp_mask=0xd0

Mar 9 07:33:49 sheep pppd[30937]: Starting link
Mar 9 08:01:35 sheep kernel: oom-killer: gfp_mask=0x1d2

and 30min later OOM kicks in. normally, pppd (pppoe) gives messages like this:

Mar 10 14:23:38 sheep pppd[26365]: Starting link
Mar 10 14:23:38 sheep pppd[26365]: Serial connection established.
Mar 10 14:23:38 sheep pppd[26365]: Connect: ppp0 <--> /dev/pts/0
Mar 10 14:23:38 sheep pppoe[26383]: PADS: Service-Name: ''
Mar 10 14:23:38 sheep pppoe[26383]: PPP session is 6804
Mar 10 14:23:39 sheep pppd[26365]: CHAP authentication succeeded
Mar 10 14:23:40 sheep pppd[26365]: Local IP address changed to
[...]

is this strange? or not?

i hope someone has a hint for me, because "going back to the stable
kernel" would mean "being bound to 2.6.11-rc2-bk10" :(

thank you for any hints,
Christian.

PS: Steven, i've cc'ed you because you have trouble with new 2.6.11
kernels and pppd too. maybe unrelated, maybe not.
--
BOFH excuse #185:

system consumed all the paper for paging

2005-03-11 00:40:38

by Andrew Morton

[permalink] [raw]
Subject: Re: oom with 2.6.11

Christian Kujau <[email protected]> wrote:
>
> i was going to compile 2.6.11-rc5-bk4, to sort out the "bad" kernel.
> compiling went fine. ok, finished some email, ok, suddenly my swap was
> used up again, and no memory left - uh oh! OOM again, with 2.6.11-rc5-bk2!

Well if you ran out of swap then yes, the oom-killer will visit you.

Why did you run out of swapspace?

2005-03-11 01:15:57

by Christian Kujau

[permalink] [raw]
Subject: Re: oom with 2.6.11

Andrew Morton wrote:
> Christian Kujau <[email protected]> wrote:
>
>>i was going to compile 2.6.11-rc5-bk4, to sort out the "bad" kernel.
>>compiling went fine. ok, finished some email, ok, suddenly my swap was
>>used up again, and no memory left - uh oh! OOM again, with 2.6.11-rc5-bk2!
>
>
> Well if you ran out of swap then yes, the oom-killer will visit you.
>
> Why did you run out of swapspace?

hm, if i only knew. i don't know how long it took the other night to go
from "normal" to "OOM". but today, with 2.6.11-rc5-bk2 (well, yesterday
actually) i was working normally, and all of a sudden swap goes from 170MB
used swap (normal) to OOM. i think it took a minute or so, but i just
can't tell which application went nuts. today the first process that got
killed was "ssh-agent", the other day it was mysqld. but even after this,
it should've released some memory, right? but the oom-killer goes on and
on and kills the next task.

i'll monitor memory usage tonight and see what it gives. these "pppd"
messages are suspicious though.

thank you,
Christian.
--
BOFH excuse #135:

You put the disk in upside down.

2005-03-11 07:45:40

by Pasi Kärkkäinen

[permalink] [raw]
Subject: Re: oom with 2.6.11

On Fri, Mar 11, 2005 at 02:14:14AM +0100, Christian Kujau wrote:
> Andrew Morton wrote:
> > Christian Kujau <[email protected]> wrote:
> >
> >>i was going to compile 2.6.11-rc5-bk4, to sort out the "bad" kernel.
> >>compiling went fine. ok, finished some email, ok, suddenly my swap was
> >>used up again, and no memory left - uh oh! OOM again, with 2.6.11-rc5-bk2!
> >
> >
> > Well if you ran out of swap then yes, the oom-killer will visit you.
> >
> > Why did you run out of swapspace?
>
> hm, if i only knew. i don't know how long it took the other night to go
> from "normal" to "OOM". but today, with 2.6.11-rc5-bk2 (well, yesterday
> actually) i was working normally, and all of a sudden swap goes from 170MB
> used swap (normal) to OOM. i think it took a minute or so, but i just
> can't tell which application went nuts. today the first process that got
> killed was "ssh-agent", the other day it was mysqld. but even after this,
> it should've released some memory, right? but the oom-killer goes on and
> on and kills the next task.
>
> i'll monitor memory usage tonight and see what it gives. these "pppd"
> messages are suspicious though.
>

I've also seen this 3 times now.. I'm running Xen 2.0 and the 2.6.10-xen0
kernel (with -as5 patch) goes OOM after a couple of days operation.. and
then the box reboots.

The "virtual machines" are OK, it's only the dom0 kernel that goes OOM..
And I'm not running anything special on dom0, only xen control stuff
(which is written in python..), ntp, nfs server, ssh, lvm2 and software raid.

Now I'm running a script which logs the cpu/memory/swap usage every 1
minutes.. trying to see if I can find the cause for the OOM.

-- Pasi K?rkk?inen

^
. .
Linux
/ - \
Choice.of.the
.Next.Generation.

2005-03-11 09:01:18

by Mauricio Lin

[permalink] [raw]
Subject: Re: oom with 2.6.11

Hi Christian,

I would like to know what are the kernel versions this problem happened.

Did this problem start from 2.6.11-rc2-bk10?

BR,

Mauricio Lin.

On Thu, 10 Mar 2005 16:12:27 +0100, Christian Kujau <[email protected]> wrote:
> ok,
>
> as "promised", it the OOM happened again with the same plain 2.6.11,
> details here.
>
> http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11_2.txt
>
> the following is a quite long, but please read on
> (if anyone is reading at all :))
>
> this time it happened at 08:01, and i could image some heavy cron jobs
> were going on. but as i said: "it did not happen before". there are also
> output of SYSRQ-T/M/P. i did SYSRQ-E to recover the machine, but then
> decided to reboot back to 2.6.11-rc5-bk2.
>
> i had a look at the changelogs too and noticed that ChangeLog-2.6.11
> contains 7 occurrences of "OOM" in the patch desctiption:
>
> [PATCH] mm: overcommit updates, 2005-01-03
> [PATCH] vmscan: count writeback pages in nr_scanned, 2005-01-08
> [PATCH] possible rq starvation on oom, 2005-01-13
> [PATCH] mm: adjust dirty threshold for lowmem-only mappings, 2005-01-25
> [PATCH] mm: oom-killer tunable, 2005-02-02
> [PATCH] mm: fix several oom killer bugs, 2005-02-02
> [PATCH] Fix oops in alloc_zeroed_user_highpage() when [...],2005-02-09
>
> release dates:
> 2.6.11-rc5-bk1 26-Feb-2005
> 2.6.11-rc5-bk2 27-Feb-2005 <
> 2.6.11-rc5-bk3 28-Feb-2005
> 2.6.11-rc5-bk4 01-Mar-2005
> 2.6.11 02-Mar-2005
>
> so i really don't see any patches that *could* have something to do with
> the issue here.
>
> now comes the weird part:
>
> i was going to compile 2.6.11-rc5-bk4, to sort out the "bad" kernel.
> compiling went fine. ok, finished some email, ok, suddenly my swap was
> used up again, and no memory left - uh oh! OOM again, with 2.6.11-rc5-bk2!
>
> to summarize it:
> i've run 2.6.11-rc2-bk10 during whole february, then switched to
> 2.6.11-rc5-bk2 on 28.02.2005, then to 2.6.11 on 05.03.2005 - and only
> noticed with 2.6.11 first, now with 2.6.11-rc5-bk2 too.
>
> there is an interesting part in the logfiles:
>
> http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.txt
> http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11_2.txt
> http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11-rc5-bk2.txt
>
> every last message before the "OOM" messages is something with pppd:
>
> Mar 10 13:45:55 sheep pppd[1567]: Starting link
> Mar 10 14:12:29 sheep kernel: oom-killer: gfp_mask=0x1d2
>
> Mar 8 00:59:58 sheep pppd[418]: Starting link
> Mar 8 01:27:33 sheep kernel: oom-killer: gfp_mask=0xd0
>
> Mar 9 07:33:49 sheep pppd[30937]: Starting link
> Mar 9 08:01:35 sheep kernel: oom-killer: gfp_mask=0x1d2
>
> and 30min later OOM kicks in. normally, pppd (pppoe) gives messages like this:
>
> Mar 10 14:23:38 sheep pppd[26365]: Starting link
> Mar 10 14:23:38 sheep pppd[26365]: Serial connection established.
> Mar 10 14:23:38 sheep pppd[26365]: Connect: ppp0 <--> /dev/pts/0
> Mar 10 14:23:38 sheep pppoe[26383]: PADS: Service-Name: ''
> Mar 10 14:23:38 sheep pppoe[26383]: PPP session is 6804
> Mar 10 14:23:39 sheep pppd[26365]: CHAP authentication succeeded
> Mar 10 14:23:40 sheep pppd[26365]: Local IP address changed to
> [...]
>
> is this strange? or not?
>
> i hope someone has a hint for me, because "going back to the stable
> kernel" would mean "being bound to 2.6.11-rc2-bk10" :(
>
> thank you for any hints,
> Christian.
>
> PS: Steven, i've cc'ed you because you have trouble with new 2.6.11
> kernels and pppd too. maybe unrelated, maybe not.
> --
> BOFH excuse #185:
>
> system consumed all the paper for paging
>

2005-03-11 10:59:10

by Coywolf Qi Hunt

[permalink] [raw]
Subject: Re: oom with 2.6.11

In file mm/oom_kill.c, uncomment line 24: /* #define DEBUG */.
And next time when oom happens again, we'll see the badness.

--coywolf

On Tue, 08 Mar 2005 16:21:21 +0100, Christian Kujau <[email protected]> wrote:
> hallo list,
>
> today my machine went out out memory and noticing it several hours after
> the first OOM message in the log, i wonder
> 1) why this happened at all and
> 2) why almost every service was killed despite the clever algorithms
> documented in mm/oom_kill.c.
>
> the first oom message went to the syslog at 01:27, i was away and no heavy
> tasks were scheduled:
>
> http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.txt
>
> mysqld got killed by the oom killer, so i have to suspect mysql for being
> the reason for oom here, even that i know that mysqld is running all day
> long. several other tasks got killed, but "Free swap" stays at 0kB and the
> oom killer kills almost every other tasks, with no success in freeing ram.
>
> the log stops at 03:21, perhaps syslog-ng got killed.
> at around 07:31 i noticed the mess, did SYSRQ-E and now i was able to
> login again. i pressed SYSRQ-M/T/P too, they are all in the log. at this
> time loadavg was at 249 ;)
>
> i went to runlevel 2, then up again to 3 and all services are up and
> running again.
>
> some 2.6.11-rc3 BK snapshot was running pretty stable (no OOM) for ~30
> days before i switched to 2.6.11 (vanilla) a few days ago. i have to (not)
> reproduce the problem the next night, i wonder if it will happen again.
>
> do you vm-gurus have any idea to the points asked above?
>
> more infos about the box here: http://nerdbynature.de/bits/sheep/2.6.11/oom/
>
> thank you for your comments,
> Christian.
> --
> BOFH excuse #281:
>
> The co-locator cannot verify the frame-relay gateway to the ISDN server.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


--
Coywolf Qi Hunt
Homepage http://sosdg.org/~coywolf/

2005-03-11 15:10:47

by Christian Kujau

[permalink] [raw]
Subject: Re: oom with 2.6.11

Mauricio Lin wrote:
> Hi Christian,
>
> I would like to know what are the kernel versions this problem happened.
>
> Did this problem start from 2.6.11-rc2-bk10?

i noticed it first at 2.6.11, then again with 2.6.11-rc5-bk2. suspecting
pppd to be the culprit to chew up all RAM after being terminated by my ISP
once a day - i just have to wait (must be around 2a.m.).

thanks,
Christian.
--
BOFH excuse #134:

because of network lag due to too many people playing deathmatch

2005-03-11 15:13:33

by Christian Kujau

[permalink] [raw]
Subject: Re: oom with 2.6.11

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Coywolf Qi Hunt wrote:
> In file mm/oom_kill.c, uncomment line 24: /* #define DEBUG */.
> And next time when oom happens again, we'll see the badness.

oh, good hint. will do this before the next reboot (in a few hours i guess)


thanks,
Christian.

- --
BOFH excuse #134:

because of network lag due to too many people playing deathmatch
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCMbTj+A7rjkF8z0wRAmdAAJ4sCIaUqZaKn7gGtkpN7Wb47acFiACgrOSe
DpT3tE1/4zfPIDueDwBMhDU=
=o6Lr
-----END PGP SIGNATURE-----

--
BOFH excuse #37:

heavy gravity fluctuation, move computer to floor rapidly

2005-03-12 18:06:50

by Christian Kujau

[permalink] [raw]
Subject: Re: oom with 2.6.11

hi again,

i had to wait for my pppoe session to be terminated by the remote peer
[1], and now it happened again with 2.6.11-rc5-bk2:

http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11-rc5-bk2_2.txt
http://nerdbynature.de/bits/sheep/2.6.11/oom/lsmod_2.6.11-rc5-bk2
http://nerdbynature.de/bits/sheep/2.6.11/oom/config-2.6.11-rc5-bk2.gz

i wanted to see the application chewing up the most RAM so i was running
"ps aux --sort=-vsz,-rss | head -n1" every 5sec and wrote it's output to a
file:

http://nerdbynature.de/bits/sheep/2.6.11/oom/daily_stats-2.6.11-rc5-bk2.log.gz

but even when almost all swap is used up, mysqld ist still no#1 with
131928 KB VSZ, as always. still suspecting something wrong with pppd, the
binary istself does *not* show up on the top-ten-memory-pigs.

real memory is almost always used up, but that's ok.
i stripped it down to show only the timestamp and the "Swap: .. used" numbers:

http://nerdbynature.de/bits/sheep/2.6.11/oom/daily_stats-2.6.11-rc5-bk2_stripped.log.gz

as you can see, until "Sat Mar 12 01:05:51" 122MB swap was used, from then
on swap-usage goes up until "Sat Mar 12 01:49:45", when all swap is in
use. the system is unable to keep up with writing memory usage every 5
seconds, the next entry is from 12 "01:56:24" - 6 minutes after the first
OOM message (i'm sure someone could feed this to gnuplot and make nice
bars out of it - i can't)

where else could i look to actually *see* where the memory goes to?

somehow lost in kernel versions,
Christian.

[1] "remote peer" aka "ISP". is there a way to "simulate" a "LCP
terminated by peer" locally? otherwise i really have to wait 24h to
trigger the bug.
--
BOFH excuse #373:

Suspicious pointer corrupted virtual machine

2005-03-15 08:52:43

by Mauricio Lin

[permalink] [raw]
Subject: Re: oom with 2.6.11

Hi Christian,

On Fri, 11 Mar 2005 16:09:24 +0100, Christian Kujau <[email protected]> wrote:
> Mauricio Lin wrote:
> > Hi Christian,
> >
> > I would like to know what are the kernel versions this problem happened.
> >
> > Did this problem start from 2.6.11-rc2-bk10?
>
> i noticed it first at 2.6.11, then again with 2.6.11-rc5-bk2. suspecting
> pppd to be the culprit to chew up all RAM after being terminated by my ISP
> once a day - i just have to wait (must be around 2a.m.).

Have you tried with 2.6.10 in order to check this problem?

BR,

Mauricio Lin.

2005-03-15 14:12:54

by Christian Kujau

[permalink] [raw]
Subject: Re: oom with 2.6.11

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mauricio Lin wrote:
>>>
>>>Did this problem start from 2.6.11-rc2-bk10?
>>
>>i noticed it first at 2.6.11, then again with 2.6.11-rc5-bk2. suspecting
>>pppd to be the culprit to chew up all RAM after being terminated by my ISP
>>once a day - i just have to wait (must be around 2a.m.).
>
> Have you tried with 2.6.10 in order to check this problem?

i don't think i've run plain 2.6.10 at all, syslog does not say so. i've
run several versions of 2.6.9, then 2.6.10-rc1, 2.6.10-rc1-bk19,
2.6.10-rc2-bk9, 2.6.10-rc3-bk11, 2.6.11-rc2-bk10, 2.6.11-rc5-bk2.

as stated earlier, the problem kicked in when i switched to 2.6.11, i
switched back to 2.6.11-rc5-bk2 and it happened again. by looking at the
syslog i am (still) suspecting pppd to be the source of OOM, but it did
not show up as a memory pig in "ps". i am running 2.6.11.3 now and the
machine survived almost 2 passes of pppd's daily hangups. i'll keep an eye
on it.

thank you for your concern,
Christian.
- --
BOFH excuse #357:

I'd love to help you -- it's just that the Boss won't let me near the
computer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCNu1e+A7rjkF8z0wRAmSPAJ49erb21bO6inHPn/zoUH2gNHJk8QCguYDA
HyzXS9whrJnb9eXVz9IP8zc=
=zhEx
-----END PGP SIGNATURE-----

2005-03-17 01:27:47

by Christian Kujau

[permalink] [raw]
Subject: Re: oom with 2.6.11

hello again,

unfortunately i've hit OOM again, this time with "#define DEBUG" enabled
in mm/oom_kill.c:

http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.3.txt

by "Mar 16 18:32" pppd died again and OOM kicked in 30min later.
(there are a *lot* messages of a shell script named "check-route.sh". it's
a little script which runs every minute or so to check if my default route
is still ok and if ping to the outside world are possible. definitely not
a memory hog, but noisy)

since tracking the "most memory consuming applications" did not reveal any
hints [1], i have monitored /proc/slabinfo and /proc/meminfo this time:

http://nerdbynature.de/bits/sheep/2.6.11/oom/daily_stats-2.6.11.3.gz

as stated before, i was suspecting pppd to be the bad guy here, and yes: i
downgraded pppd to an earlier version and pppd (and the system) survived 2
terminations of my dial-up ISP. yesterday i've upgraded back again to
current pppd (debian/unstable) and the OOM problem returned. yes, i'll bug
the debian people now (hello!), but grepping for "ppp" in
daily_stats-2.6.11.3.gz gives no hits. so "pppd" does not get *any* points
from mm/oom_kill.c and thus no attempts are made to kill it (it is always
only kill'able with "-9"). furthermore, i thought /proc/slabinfo coud give
me some hints about *where* all the memory went in. scrolling down this
file to the bottom, where "SwapFree" shows "0 kB" i don't see any alarming
numbers in the "slabinfo" right above "meminfo".

could someone give me a hint, please?

thanks,
Christian.

[1] http://lkml.org/lkml/2005/3/12/88

more info for this recent OOM issue:
http://nerdbynature.de/bits/sheep/2.6.11/oom/dmesg.2.6.11.3
http://nerdbynature.de/bits/sheep/2.6.11/oom/lsmod_2.6.11.3
http://nerdbynature.de/bits/sheep/2.6.11/oom/config-2.6.11.3.gz
--
BOFH excuse #62:

need to wrap system in aluminum foil to fix problem

2005-03-17 01:51:31

by Andrew Morton

[permalink] [raw]
Subject: Re: oom with 2.6.11

Christian Kujau <[email protected]> wrote:
>
> unfortunately i've hit OOM again, this time with "#define DEBUG" enabled
> in mm/oom_kill.c:
>
> http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.3.txt
>
> by "Mar 16 18:32" pppd died again and OOM kicked in 30min later.
> (there are a *lot* messages of a shell script named "check-route.sh". it's
> a little script which runs every minute or so to check if my default route
> is still ok and if ping to the outside world are possible. definitely not
> a memory hog, but noisy)
>
> since tracking the "most memory consuming applications" did not reveal any
> hints [1], i have monitored /proc/slabinfo and /proc/meminfo this time:
>
> http://nerdbynature.de/bits/sheep/2.6.11/oom/daily_stats-2.6.11.3.gz
>
> as stated before, i was suspecting pppd to be the bad guy here, and yes: i
> downgraded pppd to an earlier version and pppd (and the system) survived 2
> terminations of my dial-up ISP. yesterday i've upgraded back again to
> current pppd (debian/unstable) and the OOM problem returned. yes, i'll bug
> the debian people now (hello!), but grepping for "ppp" in
> daily_stats-2.6.11.3.gz gives no hits. so "pppd" does not get *any* points
> from mm/oom_kill.c and thus no attempts are made to kill it (it is always
> only kill'able with "-9").

The oom-killer tries to be nicer to processes which are running as root.

> furthermore, i thought /proc/slabinfo coud give
> me some hints about *where* all the memory went in. scrolling down this
> file to the bottom, where "SwapFree" shows "0 kB" i don't see any alarming
> numbers in the "slabinfo" right above "meminfo".

MemTotal: 256372 kB
MemFree: 3280 kB
Buffers: 608 kB
Cached: 3256 kB
SwapCached: 664 kB
Active: 105020 kB
Inactive: 20364 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 256372 kB
LowFree: 3280 kB
SwapTotal: 784468 kB
SwapFree: 0 kB
Dirty: 12 kB
Writeback: 0 kB
Mapped: 130332 kB
Slab: 61424 kB
CommitLimit: 912652 kB
Committed_AS: 1323548 kB
PageTables: 51668 kB
VmallocTotal: 778184 kB
VmallocUsed: 3464 kB
VmallocChunk: 774492 kB

Some application went berzerk, used up all the swap and then oomed the box.

You could perhaps run `top -d1' then hit M so the output is sorted by
bloatiness, then try to catch the culprit.

But it would be better to have some app which prints the N most
memory-hungry processes every second and simply scrolls that up the screen.
I'm not aware of such a thing, but it could be cooked up via
/proc/N/cmdline and /proc/N/statm.

2005-03-17 02:00:22

by Christian Kujau

[permalink] [raw]
Subject: Re: oom with 2.6.11

Andrew Morton wrote:
>
> Some application went berzerk, used up all the swap and then oomed the box.
>
> You could perhaps run `top -d1' then hit M so the output is sorted by
> bloatiness, then try to catch the culprit.

i've already done that. as OOM happens when i am not around, i did that
with "ps":

http://lkml.org/lkml/2005/3/12/88
http://nerdbynature.de/bits/sheep/2.6.11/oom/daily_stats-2.6.11-rc5-bk2.log.gz

> But it would be better to have some app which prints the N most
> memory-hungry processes every second and simply scrolls that up the screen.
> I'm not aware of such a thing, but it could be cooked up via
> /proc/N/cmdline and /proc/N/statm.

i hope the link above does reveal this information.

i just wrote a bug report for the (debian), ppp package, but to know
*where* the memory goes to would really help, i think.

thank you,
Christian.
--
BOFH excuse #72:

Satan did it

2005-03-17 21:25:15

by Coywolf Qi Hunt

[permalink] [raw]
Subject: Re: oom with 2.6.11

On Thu, 17 Mar 2005 02:27:29 +0100, Christian Kujau <[email protected]> wrote:
> hello again,
>
> unfortunately i've hit OOM again, this time with "#define DEBUG" enabled
> in mm/oom_kill.c:
>
> http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.3.txt
>
> by "Mar 16 18:32" pppd died again and OOM kicked in 30min later.
> (there are a *lot* messages of a shell script named "check-route.sh". it's
> a little script which runs every minute or so to check if my default route
> is still ok and if ping to the outside world are possible. definitely not
> a memory hog, but noisy)


I do "grep check-route.sh oom_2.6.11.3.txt | wc" and it shows 4365
lines, which means
there're 4365 that script processes running, from pid 4260 to12747,
mostly with pretty low points, 123. Based on this points, suppose
each script consumes 100k, that'll be 100k*4k=400M roughly. And your
box's is merely 256M MemTotal.

The currently oom algorithm fails to find out such kinds of memory hog.
And the kernel kills other innocent processes because the its points
is much lower than most others.

Check this script and disable it; see what will happen.

--
Coywolf Qi Hunt
http://sosdg.org/~coywolf/

2005-03-18 01:59:57

by Christian Kujau

[permalink] [raw]
Subject: Re: oom with 2.6.11

Coywolf Qi Hunt wrote:
> I do "grep check-route.sh oom_2.6.11.3.txt | wc" and it shows 4365

duh, good catch! really!

> lines, which means there're 4365 that script processes running, from
> pid 4260 to12747, mostly with pretty low points, 123.
> Based on this points, suppose each script consumes 100k, that'll be
> 100k*4k=400M roughly. And your box's is merely 256M MemTotal.

yes, i just checked, the script is looping and crond is starting a new
one, and another....and the oom-killer does not catch it, because it's too
small and of course don't know where it is coming from (crond).

> Check this script and disable it; see what will happen.

yes, will do that. on a (not so unimportant) side-note: i was told the
whole thing should be fixed with 2.6.11.4:

[PATCH] CAN-2005-0384: Remote Linux DoS on ppp servers


after all it seems to be PEBKAC and bad luck...what a week.

thank you for your help,
Christian.
--
BOFH excuse #416:

We're out of slots on the server

2005-03-20 14:35:48

by Christian Kujau

[permalink] [raw]
Subject: [SOLVED] Re: oom with 2.6.11

for the record: the "oom bug" turned out to be user generated. a *lot* of
small scripts were started, triggering oom again and again, user error.

the source of the problem is still pppd and the discussion continues as a
debian bugreport:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=299875

thank you for all your help,
Christian.
--
BOFH excuse #139:

UBNC (user brain not connected)