It seems linux-2.4 still freezes on out-of-memory situations:
I was using 2.4.2-ac3 SMP and had a fairly large background job that takes
hundreds of megabytes of memory, much more than I have:
Mem: 255296 81836 173460 0 10324 30608
Swap: 99992 0 99992
Usually I swapon ./swap some 512MB swapfile, but today I forgot it. When the
machine started to get sluggish I sent the process a -STOP signal.
Swap: 99992 99992 0
O.k, (I had about 12MB of main memory free (in the +/- buffers line of
free) and the machine was sluggish but workable for about five minutes. At
the instant I did a swapon ./swap the machine froze hard (no sysrq, no
ping etc...)
I thought these complete freezes on OOM-situations had been fixed in
2.4.x? Do I have to watch out for andrea's fix-2.4-oom patches?
;)
--
-----==- |
----==-- _ |
---==---(_)__ __ ____ __ Marc Lehmann +--
--==---/ / _ \/ // /\ \/ / [email protected] |e|
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+
The choice of a GNU generation |
|
On Sun, 25 Feb 2001, Marc Lehmann wrote:
> It seems linux-2.4 still freezes on out-of-memory situations:
<snip>
> Usually I swapon ./swap some 512MB swapfile, but today I forgot it. When the
> machine started to get sluggish I sent the process a -STOP signal.
Signal delivery during oomest does not work (last time I tested).
Andrea fixed this once.. long time ~problem.
-Mike
On Sun, Feb 25, 2001 at 05:58:32PM +0100, Mike Galbraith <[email protected]> wrote:
> > Usually I swapon ./swap some 512MB swapfile, but today I forgot it. When the
> > machine started to get sluggish I sent the process a -STOP signal.
>
> Signal delivery during oomest does not work (last time I tested).
> Andrea fixed this once.. long time ~problem.
Well, the signal delivery seemed to have worked fine - the machine
was quite usable (it swapped a lot, but the system was never unusable
for longer than a second or so). The problem started when I did the
swapon. Well, it didn't start, the system just froze.
--
-----==- |
----==-- _ |
---==---(_)__ __ ____ __ Marc Lehmann +--
--==---/ / _ \/ // /\ \/ / [email protected] |e|
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+
The choice of a GNU generation |
|
On Sun, Feb 25, 2001 at 05:58:32PM +0100, Mike Galbraith <[email protected]> wrote:
> Signal delivery during oomest does not work (last time I tested).
> Andrea fixed this once.. long time ~problem.
Hmm, here is soemthing that is new: Just now, the machine gets VERY very
sluggish and swaps:
total used free shared buffers cached
Mem: 255296 253708 1588 0 29808 183020
-/+ buffers/cache: 40880 214416
Swap: 99992 99992 0
now, there is plenty of free memory (200megs!) but no spwapsace and the
kernel keeps swapping. The only interesting processes here are:
PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND
112 ? S 0:00 742 1366 38921 3460 1.3 /opt/mysql//libexec/mysqld --basedir=/opt/mysql/ --datadir=/var/mysql --user=root --pid-
205 ? S 2:28 12335 1444 27167 4294966180 6728.9 /usr/bin/X11/X :0 -audit 1 -auth /etc/cfg/Xauthority -a 2 -once -t 5 vt02 -defer
421 pts/13 TN 1:00 804 707 31552 17444 6.8 /usr/bin/perl ./summarize
376 pts/10 R 7:07 269 129 22614 1852 0.7 rsync -av . doom cerebro-root/. --delete
when I SIGSTOP the summarize script (which uses mysql very intensively)
the system starts to work again but the memory situation does not
improve. The RSS size of X puzzles me a bit, but this was always the case
under 2.4.2 and 2.4.2ac3 (and maybe before) and didn't cause a problem
before.
Another bug I found is that initializing md on the kernel commandline in
the wrong order (first md1 then md0) keeps the kernel from mounting md0
as root-device. Another problem is that, when I "startraid /dev/md1" (a
two-partition, striped raid without persistent superblock) I get strange
errors in /var/log/kernel (if anybody asks I'll provide them) but it works
fine when I sue md=x on the kernel commandline. It's not a configuration
problem sicne I got the same strange probkems with the mdstart I used
successfully under 2.1 and 2.2.
Another nitpick is kernel-pcmcia: For some unexplainable reason, the
kernel SWITCHES OFF POWER to the pcmcia slots BEFORE notifying apmd, which
then tries to save important data and locks (not the machine, just the
script) since the network is suddenly dead although interface etc.. all
still exist. Under the pcmcia-cs package one could work around this bug by
specifying do_apm=0 for the pcmcia_core module, which has no effect under
2.4.
So I do keep asking me: does anybody actually use 2.4 on production
machines? ;-> (Historically, it seesm that my machines tend to freeze
easily because of sudden OOM and/or reiserfs ;)
--
-----==- |
----==-- _ |
---==---(_)__ __ ____ __ Marc Lehmann +--
--==---/ / _ \/ // /\ \/ / [email protected] |e|
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+
The choice of a GNU generation |
|
Oh, and one last thing I forgot: loop devices. Since 2.4.1 (the first
version I used) through 2.4.2 and 2.4.2ac3 I only get:
cerebro:~# strace -f -o x losetup -e rc6 /dev/loop0 /dev/hdd
Memory Fault
And then no access to the loop device works anymore (clearly this is after
the 2.4.0.something crypto-patch applied, so this is probably not a 2.4.2
issue anyway since there is no 2.4.2 crypto patch).
Happy Hacking ;)
--
-----==- |
----==-- _ |
---==---(_)__ __ ____ __ Marc Lehmann +--
--==---/ / _ \/ // /\ \/ / [email protected] |e|
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+
The choice of a GNU generation |
|
On Sun, 25 Feb 2001, Marc Lehmann wrote:
> On Sun, Feb 25, 2001 at 05:58:32PM +0100, Mike Galbraith <[email protected]> wrote:
> > > Usually I swapon ./swap some 512MB swapfile, but today I forgot it. When the
> > > machine started to get sluggish I sent the process a -STOP signal.
> >
> > Signal delivery during oomest does not work (last time I tested).
> > Andrea fixed this once.. long time ~problem.
>
> Well, the signal delivery seemed to have worked fine - the machine
> was quite usable (it swapped a lot, but the system was never unusable
> for longer than a second or so). The problem started when I did the
> swapon. Well, it didn't start, the system just froze.
Ok.. I guess that got fixed. Used to be it was all over as soon
as the last of swap was consumed.
-Mike
On Sun, 25 Feb 2001, Marc Lehmann wrote:
> Oh, and one last thing I forgot: loop devices. Since 2.4.1 (the first
> version I used) through 2.4.2 and 2.4.2ac3 I only get:
>
> cerebro:~# strace -f -o x losetup -e rc6 /dev/loop0 /dev/hdd
> Memory Fault
>
> And then no access to the loop device works anymore (clearly this is after
> the 2.4.0.something crypto-patch applied, so this is probably not a 2.4.2
> issue anyway since there is no 2.4.2 crypto patch).
Hmm.. I remember having this problem and it was a problem with strace.
I fiddled with it and got it to work, but damned if I remember what
I did (had to be something trivial since I know jack spit about it:).
Anyway, it works fine here with virgin 2.4.2, so it seems unlikely it's
a kernel problem.
259 execve("/sbin/losetup", ["losetup", "/dev/loop0", "/dev/hda5"], [/* 47 vars */]) = 0
259 brk(0) = 134525772
259 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40018000
259 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory)
259 open("/etc/ld.so.cache", O_RDONLY) = 4
259 fstat64(4, {st_mode=S_IFREG|0644, st_size=78316, ...}) = 0
259 old_mmap(NULL, 78316, PROT_READ, MAP_PRIVATE, 4, 0) = 0x40019000
259 close(4) = 0
259 open("/lib/libc.so.6", O_RDONLY) = 4
259 read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200\324"..., 1024) = 1024
259 fstat64(4, {st_mode=S_IFREG|0755, st_size=1401805, ...}) = 0
259 old_mmap(NULL, 1134564, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0x4002d000
259 mprotect(0x40138000, 40932, PROT_NONE) = 0
259 old_mmap(0x40138000, 28672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0x10a000) = 0x40138000
259 old_mmap(0x4013f000, 12260, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4013f000
259 close(4) = 0
259 munmap(0x40019000, 78316) = 0
259 getpid() = 259
259 brk(0) = 134525772
259 brk(0x804b374) = 134525812
259 brk(0x804c000) = 134529024
259 open("/dev/hda5", O_RDWR) = 4
259 open("/dev/loop0", O_RDWR) = -1 ENOSYS (Function not implemented)
259 open("/dev/loop0", O_RDWR) = -1 ENOSYS (Function not implemented)
259 open("/dev/loop0", O_RDWR) = 5
259 mlockall(0x3, 0xbffff9d3) = 0
259 ioctl(5, LOOP_SET_FD, 0x4) = 0
259 ioctl(5, LOOP_SET_STATUS, 0xbffff700) = 0
259 close(5) = 0
259 close(4) = 0
259 _exit(0) = ?
On Mon, Feb 26, 2001 at 08:11:55AM +0100, Mike Galbraith <[email protected]> wrote:
> Hmm.. I remember having this problem and it was a problem with strace.
Well, I obviously strace'd it to find out why I get a memory fault without
one (I would be happy if it worked without strace ;->)
> Anyway, it works fine here with virgin 2.4.2, so it seems unlikely it's
> a kernel problem.
> 259 execve("/sbin/losetup", ["losetup", "/dev/loop0", "/dev/hda5"], [/* 47 vars */]) = 0
The -e switch is causing the memory fault and subsequent breakage:
743 open("/dev/hdd", O_RDWR) = 4
743 open("/dev/loop0", O_RDWR) = 5
743 mlockall(0x3, 0x804c272) = 0
743 ioctl(5, LOOP_SET_FD, 0x4) = -1 ENOSYS (Function not implemented)
743 ioctl(5, LOOP_SET_FD, 0x4) = 0
743 ioctl(5, LOOP_SET_STATUS, 0xbffff5d8) = -1 ENOSYS (Function not implemented)
743 ioctl(5, LOOP_SET_STATUS, 0xbffff5d8) = -1 ENOSYS (Function not implemented)
743 ioctl(5, LOOP_SET_STATUS, 0xbffff5d8) = -1 ENOSYS (Function not implemented)
743 ioctl(5, LOOP_SET_STATUS, 0xbffff5d8) = -1 ENOSYS (Function not implemented)
743 ioctl(5, LOOP_SET_STATUS <unfinished ...>
743 +++ killed by SIGSEGV +++
(which is a strange strace anyway...)
However, I just need to wait until there is a new crypto patch (and, if
not, I'll eventually have to hack it myself to gte my data. After all it's
source... ...)
--
-----==- |
----==-- _ |
---==---(_)__ __ ____ __ Marc Lehmann +--
--==---/ / _ \/ // /\ \/ / [email protected] |e|
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+
The choice of a GNU generation |
|
On Mon, 26 Feb 2001, Marc Lehmann wrote:
> On Mon, Feb 26, 2001 at 08:11:55AM +0100, Mike Galbraith <[email protected]> wrote:
>
> > Anyway, it works fine here with virgin 2.4.2, so it seems unlikely it's
> > a kernel problem.
>
> > 259 execve("/sbin/losetup", ["losetup", "/dev/loop0", "/dev/hda5"], [/* 47 vars */]) = 0
>
> The -e switch is causing the memory fault and subsequent breakage:
No problem here using -e xor. (have no real crypto to try)
> However, I just need to wait until there is a new crypto patch (and, if
> not, I'll eventually have to hack it myself to gte my data. After all it's
> source... ...)
Probably.
-Mike