2006-11-11 16:40:27

by Christian Kujau

[permalink] [raw]
Subject: OOM in 2.6.19-rc*

Hello,

a few days ago I upgraded my desktop machine (x86_64) to ubuntu/edgy
thus completely changing the userland. Since I'm using kernel.org
kernels I upgraded to a current kernel as well (2.6.19-rc4-git from Nov
4 and 2.6.19-rc4-mm2). Now, while working under X11, probably reading
email, all of a sudden the machine was not responsible any more and the
disk was spinning like wild. The desktop applet showed all swap being
used up then the display froze too and ~5 min later the machine came
back with the gnome-login screen: it had not rebooted but ran OOM and
several apps got killed.

OK, must be some application leaking memory, I thought, that's what
happens to new userland version. Looking at the syslog, "nautilus"
(gnome filemanager) invoked the oom killer. OK, but the scenario
repeated the next day, early in the morning when I was not even on the
box, saying it was nautilus again.
In the last days other applications seem to invoke the OOM killer as
well and I wonder if each one of them is really to blame for leaking
memory or something else would be responsible for the killings. Here's
log output, each listing the first appliction triggering the OOM killer:

# for i in /var/log/messages*; do (zgrep "invoked" "$i" | head -1 ); done
Nov 11 08:04:16 prinz64 kernel: [104237.902269] firefox-bin invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Nov 10 07:59:34 prinz64 kernel: [64627.382818] Xorg invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Nov 9 07:59:22 prinz64 kernel: [25047.487534] rpc.idmapd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=-17
Nov 8 17:33:59 prinz64 kernel: [ 919.954547] beep-media-play invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Nov 7 18:55:23 prinz64 kernel: [ 842.590646] firefox-bin invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Nov 5 07:55:34 prinz64 kernel: [18128.545690] nautilus invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Nov 4 17:31:23 prinz64 kernel: [ 688.904652] nautilus invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0

The kernels running when these were happening:
Nov 4 - 2.6.19-rc2
Nov 5 - 2.6.19-rc2
Nov 7 - 2.6.19-rc4-mm2
Nov 8 - 2.6.19-rc4
Nov 9 - 2.6.19-rc4
Nov 10 - 2.6.19-rc4
Nov 11 - 2.6.19-rc4

Because killing these application does not seem to free up memory,
plenty of other applications got killed shortly after this. Full logs
and .config can be found here: http://nerdbynature.de/bits/2.6.19-rc4/

I do notice anacron running just before the killings - but: even *if*
anacron runs a mem-leaking program: should the OOM killer just kill that
app and not the (probably) innocent ones in the first place?

Thanks for your thoughts,
Christian.
--
BOFH excuse #194:

We only support a 1200 bps connection.


2006-11-11 17:23:10

by Benoit Boissinot

[permalink] [raw]
Subject: Re: OOM in 2.6.19-rc*

On 11/11/06, Christian Kujau <[email protected]> wrote:
> Hello,
>
> a few days ago I upgraded my desktop machine (x86_64) to ubuntu/edgy
> thus completely changing the userland. Since I'm using kernel.org
> kernels I upgraded to a current kernel as well (2.6.19-rc4-git from Nov
> 4 and 2.6.19-rc4-mm2). Now, while working under X11, probably reading
> email, all of a sudden the machine was not responsible any more and the
> disk was spinning like wild. The desktop applet showed all swap being
> used up then the display froze too and ~5 min later the machine came
> back with the gnome-login screen: it had not rebooted but ran OOM and
> several apps got killed.
>
Just a thought, do you have a swap activated ? (there is a bug in edgy
where the swap isn't mounted)

regards,

Benoit

2006-11-11 18:19:35

by Adrian Bunk

[permalink] [raw]
Subject: Re: OOM in 2.6.19-rc*

On Sat, Nov 11, 2006 at 04:40:17PM +0000, Christian Kujau wrote:
> Hello,
>
> a few days ago I upgraded my desktop machine (x86_64) to ubuntu/edgy
> thus completely changing the userland. Since I'm using kernel.org
> kernels I upgraded to a current kernel as well (2.6.19-rc4-git from Nov
> 4 and 2.6.19-rc4-mm2). Now, while working under X11, probably reading
> email, all of a sudden the machine was not responsible any more and the
> disk was spinning like wild. The desktop applet showed all swap being
> used up then the display froze too and ~5 min later the machine came
> back with the gnome-login screen: it had not rebooted but ran OOM and
> several apps got killed.
>...

Can you test whether an older kernel (preferably the one that worked
before) shows the same problem?

This way you might know whether it's a kernel problem or a distribution
problem.

> Thanks for your thoughts,
> Christian.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2006-11-11 18:31:55

by Christian Kujau

[permalink] [raw]
Subject: Re: OOM in 2.6.19-rc*

On Sat, 11 Nov 2006, Benoit Boissinot wrote:
> On 11/11/06, Christian Kujau <[email protected]> wrote:
> Just a thought, do you have a swap activated ? (there is a bug in edgy
> where the swap isn't mounted)

ah, forgot to mention this: yes, swap was activated (~300 MB swapfile,
box has 1GB RAM) but I disabled it after the 2nd incident because I
thought the machine would recover faster from OOM when no swap to fill
up was available...didn't help much though :(

Thanks,
Christan.
--
BOFH excuse #142:

new guy cross-connected phone lines with ac power bus.

2006-11-11 18:38:12

by Christian Kujau

[permalink] [raw]
Subject: Re: OOM in 2.6.19-rc*

On Sat, 11 Nov 2006, Adrian Bunk wrote:
> Can you test whether an older kernel (preferably the one that worked
> before) shows the same problem?

I could try 2.6.17...but currently I don't know how to reproduce the OOM
condition - so I'd have to wait 24h until *something* happens and the
OOM killer kicks in.

> This way you might know whether it's a kernel problem or a distribution
> problem.

I think I'm more interested as to why the OOM killer seems to kill
innocent apps at random. I can imagine that it's not easy for the kernel
to tell which userland-application is using up too much memory. Hm,
egrep -r "OOM|ut of memory" Documentation/ does not reveal much :(

Thanks,
Christian.
--
BOFH excuse #362:

Plasma conduit breach

2006-11-11 18:53:46

by Adrian Bunk

[permalink] [raw]
Subject: Re: OOM in 2.6.19-rc*

On Sat, Nov 11, 2006 at 06:38:05PM +0000, Christian Kujau wrote:
> On Sat, 11 Nov 2006, Adrian Bunk wrote:
> >Can you test whether an older kernel (preferably the one that worked
> >before) shows the same problem?
>
> I could try 2.6.17...but currently I don't know how to reproduce the OOM
> condition - so I'd have to wait 24h until *something* happens and the
> OOM killer kicks in.

If you want to know what caused your provlem, this is the logical first
step.

> >This way you might know whether it's a kernel problem or a distribution
> >problem.
>
> I think I'm more interested as to why the OOM killer seems to kill
> innocent apps at random. I can imagine that it's not easy for the kernel
> to tell which userland-application is using up too much memory. Hm,
> egrep -r "OOM|ut of memory" Documentation/ does not reveal much :(

mm/oom_kill.c is well documented.

> Thanks,
> Christian.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2006-11-11 19:13:08

by Christian Kujau

[permalink] [raw]
Subject: Re: OOM in 2.6.19-rc*

On Sat, 11 Nov 2006, Adrian Bunk wrote:
> mm/oom_kill.c is well documented.

Thanks, I'll take a look.

Christian.
--
BOFH excuse #353:

Second-system effect.

2006-11-11 20:38:07

by Tim Schmielau

[permalink] [raw]
Subject: Re: OOM in 2.6.19-rc*

On Sat, 11 Nov 2006, Christian Kujau wrote:

> I think I'm more interested as to why the OOM killer seems to kill innocent
> apps at random. I can imagine that it's not easy for the kernel to tell which
> userland-application is using up too much memory. Hm, egrep -r "OOM|ut of
> memory" Documentation/ does not reveal much :(

A look at /proc/*/oom_score might shed some light on the "at random" part.
I.e., doing
for job in /proc/[0-9]* ; do \
echo -e "`cat $job/oom_score` \t $job \t `head -c50 $job/cmdline`"; \
done | sort -n
the last process listed is considered the biggest memory hog of the
moment (Of course, this still does not tell _why_).

Tim

2006-11-12 09:15:19

by Arjan van de Ven

[permalink] [raw]
Subject: Re: OOM in 2.6.19-rc*

On Sat, 2006-11-11 at 16:40 +0000, Christian Kujau wrote:
> Hello,
>
> a few days ago I upgraded my desktop machine (x86_64) to ubuntu/edgy
> thus completely changing the userland. Since I'm using kernel.org
> kernels I upgraded to a current kernel as well (2.6.19-rc4-git from Nov
> 4 and 2.6.19-rc4-mm2).

which modules/drivers do you use? Maybe there's a less commonly used on
in there that we could look at.
(The assumption is that all commonly used ones would have shown up
en-masse on lkml if there was a big leak in them; rarer ones less so)

2006-11-13 00:56:34

by Christian Kujau

[permalink] [raw]
Subject: Re: OOM in 2.6.19-rc*

Oh dear, Murphy hits again....or was it Heisenberg? Since I posted to
lkml the daily OOM killings went away. I'm running 2.6.19-rc5-mm1 right
now and no OOM situation today..phew.

On Sun, 12 Nov 2006, Arjan van de Ven wrote:
> which modules/drivers do you use? Maybe there's a less commonly used on
> in there that we could look at.

Thanks for your reply (all your replies!), FWIW:

# lsmod
Module Size Used by
dm_crypt 12304 0
dm_mod 55280 1 dm_crypt
powernow_k8 10584 0
freq_table 4168 1 powernow_k8
w83627hf 28944 0
hwmon_vid 3648 1 w83627hf
eeprom 6992 0
i2c_dev 7368 0
i2c_isa 5184 1 w83627hf
ide_cd 39520 0
cdrom 37160 1 ide_cd
ide_disk 14272 0
ata_generic 6468 0
libata 106920 1 ata_generic
qla2xxx 154668 0
firmware_class 9216 1 qla2xxx
snd_intel8x0 32872 2
snd_ac97_codec 108440 1 snd_intel8x0
snd_ac97_bus 2816 1 snd_ac97_codec
ohci1394 33032 0
ieee1394 93168 1 ohci1394
snd_pcm_oss 41440 0
snd_mixer_oss 16512 1 snd_pcm_oss
snd_pcm 74828 3 snd_intel8x0,snd_ac97_codec,snd_pcm_oss
snd_timer 22536 1 snd_pcm
k8temp 5440 0
scsi_transport_fc 39492 1 qla2xxx
i2c_nforce2 5696 0
i2c_core 20056 5 w83627hf,eeprom,i2c_dev,i2c_isa,i2c_nforce2
amd74xx 15344 0 [permanent]
ide_core 130300 3 ide_cd,ide_disk,amd74xx
snd 56680 10 snd_intel8x0,snd_ac97_codec,snd_pcm_oss,snd_mixer_oss,snd_pcm,snd_timer
soundcore 7648 1 snd
hwmon 3168 2 w83627hf,k8temp
snd_page_alloc 8464 2 snd_intel8x0,snd_pcm

# uname -a
Linux prinz64 2.6.19-rc5-mm1 #4 PREEMPT Sat Nov 11 16:02:25 GMT 2006 x86_64 GNU/Linux


Christian.
--
BOFH excuse #21:

POSIX compliance problem