2003-11-17 07:18:55

by Keith Whyte

[permalink] [raw]
Subject: 2.4.18 fork & defunct child.

I'm at a loss to get myself out of this one, folks, i really have tried.
In desperation i am posting to linux-kernel in the hopes that one of you good
folks has seen this behaviour before.

I have a kernel 2.4.18 install, based on a slackware 8.1 system
This system was installed almost a year ago and within two weeks of being up
and running, I began to have problems compiling.
make would fail with the likes of:
make[3]: *** wait: No child processes. Stop.
make[3]: *** Waiting for unfinished jobs....
make[3]: *** wait: No child processes. Stop.
make[2]: *** [first_rule] Error 2

i discovered that also, and often, programs like grep echo, cut.. would fork
and hang. and this was what was spoiling the makes.

in this case (a kernel compile), the following is from ps axf:

17785 pts/0 T 0:00 touch /usr/src/linux-2.4.18/include/linux/ip.h
17786 pts/0 Z 0:00 \_ [touch <defunct>]

I tried reinstalling everything in /lib and other things but only a clean
reinstallation would fix it, but the problem kept coming back after a few days.

To cut a long story short, after many clean reinstallations, hardware changes,
and me complaining to the isp about what i thought was dodgy hardware, it
finally seemed to be working reliably. Now, after some 6 months, the problem
has returned.
(this machine is located at a remote isp, i have never seen it, this makes it
dificult to try a new kernel for example, as if it doesn't come up with nic's
and all, the isp will charge heavily to intervene and fix it.)

This machine is still running the default kernel and modules and libc's from
slackware 8.1.

I have made a directory (/sys2), installed some base packages below there, and
when i chroot /sys2 , I can demonstrate the following:

(i read about strace not following forks or something on linux and i don't
understand it fully, but why in one case is it doing these lseek and fork
operations and in the other it isn't?)


in the "normal" system:


root@califas:~# strace /bin/true
execve("/bin/true", ["/bin/true"], [/* 25 vars */]) = 0
brk(0) = 0x8049c2c
open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=20514, ...}) = 0
old_mmap(NULL, 20514, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40015000
close(3) = 0
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0h\222\1"..., 1024) =
1024
fstat64(3, {st_mode=S_IFREG|0755, st_size=5029105, ...}) = 0
old_mmap(NULL, 1191168, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x4001b000
mprotect(0x40134000, 40192, PROT_NONE) = 0
old_mmap(0x40134000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3,
0x119000) = 0x40134000
old_mmap(0x4013a000, 15616, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4013a000
close(3) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x4013e000
munmap(0x40015000, 20514) = 0
brk(0) = 0x8049c2c
brk(0x8049c46) = 0x8049c46
getpid() = 17900
open("/proc/17900///////////exe", O_RDONLY) = 3
lseek(3, 12, SEEK_SET) = 12
read(3, "p\"\0\0", 4) = 4
lseek(3, 0, SEEK_END) = 11693
lseek(3, 8816, SEEK_SET) = 8816
brk(0) = 0x8049c46
brk(0x804a769) = 0x804a769
read(3, "\351o\10\0\0\215v\0U\211\345\353\3X\353s\350\370\377\377"..., 2877) =
2877
close(3) = 0
getppid() = 17899
fork() = 17901
waitpid(17901,

and it hangs till i kill the strace process


in the chroot system:

root@califas:/# strace /bin/true
execve("/bin/true", ["/bin/true"], [/* 22 vars */]) = 0
brk(0) = 0x8049c2c
open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=14328, ...}) = 0
old_mmap(NULL, 14328, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40015000
close(3) = 0
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0h\222\1"..., 1024) =
1024
fstat64(3, {st_mode=S_IFREG|0755, st_size=5029105, ...}) = 0
old_mmap(NULL, 1191168, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x40019000
mprotect(0x40132000, 40192, PROT_NONE) = 0
old_mmap(0x40132000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3,
0x119000) = 0x40132000
old_mmap(0x40138000, 15616, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x40138000
close(3) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x4013c000
munmap(0x40015000, 14328) = 0
brk(0) = 0x8049c2c
brk(0x8049c46) = 0x8049c46
getpid() = 17904
open("/proc/17904///////////exe", O_RDONLY) = -1 ENOENT (No such file or
directory)
brk(0x8049c2c) = 0x8049c2c
brk(0) = 0x8049c2c
brk(0x8049c54) = 0x8049c54
brk(0x804a000) = 0x804a000
_exit(0) = ?


here's a diff -y of those:

execve("/bin/true", ["/bin/true"], [/* 22 vars */]) = 0 | execve
("/bin/true", ["/bin/true"], [/* 25 vars */]) = 0
brk(0) = 0x8049c2c brk
(0) = 0x8049c2c
open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such open
("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such
open("/etc/ld.so.cache", O_RDONLY) = 3 open
("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=14328, ...}) = 0 | fstat64(3,
{st_mode=S_IFREG|0644, st_size=20514, ...}) = 0
old_mmap(NULL, 14328, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40015 | old_mmap(NULL,
20514, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40015
close(3) = 0 close
(3) = 0
open("/lib/libc.so.6", O_RDONLY) = 3 open
("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0h\222 read
(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0h\222
fstat64(3, {st_mode=S_IFREG|0755, st_size=5029105, ...}) = 0 fstat64(3,
{st_mode=S_IFREG|0755, st_size=5029105, ...}) = 0
old_mmap(NULL, 1191168, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, | old_mmap(NULL,
1191168, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3,
mprotect(0x40132000, 40192, PROT_NONE) = 0 | mprotect
(0x40134000, 40192, PROT_NONE) = 0
old_mmap(0x40132000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE | old_mmap
(0x40134000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE
old_mmap(0x40138000, 15616, PROT_READ|PROT_WRITE, MAP_PRIVATE | old_mmap
(0x4013a000, 15616, PROT_READ|PROT_WRITE, MAP_PRIVATE
close(3) = 0 close
(3) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_AN | old_mmap(NULL,
4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_AN
munmap(0x40015000, 14328) = 0 | munmap
(0x40015000, 20514) = 0
brk(0) = 0x8049c2c brk
(0) = 0x8049c2c
brk(0x8049c46) = 0x8049c46 brk
(0x8049c46) = 0x8049c46
getpid() = 17904 | getpid
() = 17900
open("/proc/17904///////////exe", O_RDONLY) = -1 ENOENT (No s | open
("/proc/17900///////////exe", O_RDONLY) = 3
brk(0x8049c2c) = 0x8049c2c | lseek(3, 12,
SEEK_SET) = 12
brk(0) = 0x8049c2c | read(3, "p\"\0
\0", 4) = 4
brk(0x8049c54) = 0x8049c54 | lseek(3, 0,
SEEK_END) = 11693
brk(0x804a000) = 0x804a000 | lseek(3, 8816,
SEEK_SET) = 8816
_exit(0) = ? | brk
(0) = 0x8049c46
> brk
(0x804a769) = 0x804a769
> read
(3, "\351o\10\0\0\215v\0U\211\345\353\3X\353s\350\370\377
> close
(3) = 0
> getppid
() = 17899
> fork
() = 17901
> waitpid(17901,


Thanks for your help.

Keith.


2003-11-18 00:28:10

by Keith Whyte

[permalink] [raw]
Subject: Re: 2.4.18 fork & defunct child.

Edgar Toernig wrote:

{ strace listing deleted, see
http://marc.theaimsgroup.com/?l=linux-kernel&m=106905386725308&w=2 }

>That is not normal /bin/true behaviour. Sure your system
>isn't hacked? Give the -f option to ptrace to see what the
>forked process is trying to do... Compare the size of
>/bin/true with a known-good one.
>
>Ciao, ET.
>

I'm not sure. I should be running tripwire or something, this is the
only one of my systems that doesn't run such a thing, as i have the
firewall locked down and have been busy.
But it is true i accidently did iptables -F and it was left that way for
a few days.

But this happens with any program, not just /bin/true, also the
/bin/true on the root and chroot systems are identical. and with much
interest i discovered, that if i unmount /proc, the problem goes away. aggh.

that is why it is not exhibiting itself in the chroot system, - no /proc.

I also remember that when this first happen nearly a year ago, some
"unix engineer" at the ISP said, oh yeah that's because something in the
ext2 filesystem header is corrupted.. i don't quite remember what he
said exactly, something that sounded so far fetched that i ignored it.
does that ring any bells with anyone?

please help, ug, i hate having a linux system that's not reliable. feels
like having a pet that's in pain or something.

btw,
/lib/libc.so.6 -> libc-2.2.5.so

Keith

(i'm cross-posting here to gcc and admin in the hopes of finding someone
who has seen this, thanks!)



2003-11-18 00:44:12

by Keith Whyte

[permalink] [raw]
Subject: Re: 2.4.18 fork & defunct child.


>Weird. Totally weird.
>
>Have you checked the systems for root kits? I'm really out of ideas
>here other than the usual hardwarehosed/systemcompromised. One thing
>I can vouch for is Slackware 8.1 working ok as is, we've installed
>dozens of that particular release and all the machines are still
>humming away in the wild nicely.
>
>
>
Thanks Tomas,
weird it is, it has me stumped. I'm no spring chicken with linux systems
and i also have a slackware 8.1 system running fine on PCchips hardware
for years. (well since slackware 8.1 came out, and before that it had
7). But this is the only machine i've ever run a distro kernel on.

umounting /proc removes the problem.
what could be in there in proc that would be causing it? something
misrepresented about the memory? or some other resource?


One thing i have noticed is that this happens:
kernel: PCI_IDE: unknown IDE controller on PCI bus 00 device f9,
VID=8086, DID=24cb
kernel: PCI: Device 00:1f.1 not available because of resource collisions
on boot.

I sent some more info about the problem earlier to linux-kernel.

http://marc.theaimsgroup.com/?l=linux-kernel&m=106911546802893&w=2


thanks


2003-11-18 01:00:27

by Maciej Żenczykowski

[permalink] [raw]
Subject: Re: 2.4.18 fork & defunct child.

> { strace listing deleted, see
> http://marc.theaimsgroup.com/?l=linux-kernel&m=106905386725308&w=2 }

well, I strace'd by glibc 2.3.2 system /bin/true and it doesn't fork and
doesn't open proc (first place the two straces differ). Maybe your
libraries have been hacked - seems the most likely to me - if this is
happening for all programs than the libc is likely bad...

I can't understand what it is opening /proc/.../exe for and I don't
understand what the ///////// in there is for (I think more than 2
consecutive slashes are illegal in POSIX, not sure though, never use more
than 2 :) )

On a side note /bin/true should take up somewhere like 10 bytes asm code -
what the hell is that thing doing more than exit(1) for? it shouldn't open
any files at all... what a bad design (and true --help and true --version
don't work anyway... duh!)

perhaps try ltrace'ing /bin/true and see what that prints out?

Cheers,
MaZe.



2003-11-18 10:39:18

by Frank van Maarseveen

[permalink] [raw]
Subject: Re: 2.4.18 fork & defunct child => system is hacked

On Mon, Nov 17, 2003 at 06:26:00PM -0600, Keith Whyte wrote:
>
> { strace listing deleted, see
> http://marc.theaimsgroup.com/?l=linux-kernel&m=106905386725308&w=2 }

First of all, /bin/true doing a fork() basically means you've
been hacked: there should not be any such code in there. The
open("/proc/17904///////////exe" is anouther piece of clear evidence
that your system has been hacked.

Why the additional slashes?

I suspect a library/or LD_PRELOAD hack which simply encodes the getpid()
return value in decimal notation and stores it right into a static
buffer containing

"/proc//////////////////exe"

because it can't use sprintf at that point for some reason (maybe
just because it is a library/LD_PRELOAD hack).


--
Frank

2003-11-19 19:47:25

by Keith Whyte

[permalink] [raw]
Subject: Re: 2.4.18 fork & defunct child => system is hacked

Frank van Maarseveen wrote:

>On Mon, Nov 17, 2003 at 06:26:00PM -0600, Keith Whyte wrote:
>
>
>>{ strace listing deleted, see
>>http://marc.theaimsgroup.com/?l=linux-kernel&m=106905386725308&w=2 }
>>
>>
>
>First of all, /bin/true doing a fork() basically means you've
>been hacked: there should not be any such code in there. The
>open("/proc/17904///////////exe" is anouther piece of clear evidence
>that your system has been hacked.
>
>Why the additional slashes?
>

Is it at all possible that this behaviour is due to strace?
I have just installed under a fresh directory, from the slackware
packages, the glibc-so libs, a few progs, strace, and chroot'ed into
that system.
I still get the same behaviour. So does that mean it _has_ to be the
kernel that is at fault?

a cmp on the distro kernel and the one on my system does show this..:

cmp -b -l /boot/vmlinuz /home/r2/boot/vmlinuz
499 1 ^A 0 ^@

but that is the rootflags, no? I must have set it ro before.


I am going to compile a kernel on a clean machine and boot the machine
with that as soon as i can get somebody down there to monitor it in case
it doesn't come back up with the new kernel.

>I suspect a library/or LD_PRELOAD hack which simply encodes the getpid()
>return value in decimal notation and stores it right into a static
>buffer containing
>
> "/proc//////////////////exe"
>
>because it can't use sprintf at that point for some reason (maybe
>just because it is a library/LD_PRELOAD hack).
>
>
>
>
I think I vaguely know what your saying here, but why? why would it have
happened as soon as the machine was first brought up.. (after the
initial install), then agian after a reinstall, and then go away. why
then would it happen again some months later? and how would they have
hacked it? it only runs ssh and apache. no sendmail, no bind, none of
those usual culprits. apache is not running as root. the only other
listener is identd.
it also runs nfsd, but connections are firewalled, from anything other
than a 192.168.0.1 address configured on the second NIC. ah, but then i
did accidentally open the firewall recently for a few days.

hmmm.