LinuxLists.cc - Mysterious lockups with 2.4.X (Help w/KDB)

2001-02-25 18:39:58

Subject: Mysterious lockups with 2.4.X (Help w/KDB)

Hello,

I've been experiencing some strange lock ups and I hope someone can
help. I don't remember exactly when they started, but I know it was
happening around the release of 2.4.0, possibly earlier with the test
2.4.0 kernels too. Around 2.4.0 time, I started patching in KDB to help
find the problem as it helped find an emu10k issue a while ago.

First, my hardware:
I'm using a Tyan Tiger 133a motherboard (Via Apollo Pro 133a chipset)
with dual P3 600s, 512M RAM, Soundblaster Live, Matrox G[24]00 (was
using the G400, had a hardware problem, went back to a G200), Promise
Ultra66 PCI IDE card, ADS Tech IEEE 1394 PCI card, Adaptec 2940 PCI SCSI
card, and a Linksys 10/100 NIC. I'm also using a few USB devices: USB
mouse, USB compact flash reader, and USB Rio500. Because of the
continuing problems with PCI routing, SMP-enabled kernels, and the Via's
APIC, I need to disable APIC (with "noapic") when I boot if I wish to
use my USB devices. Finally, I have a serial console hooked to this
machine so that I can capture dump info and use KDB.

When I get a crash (it happened 3 times yesterday), a message like this
appears on the serial console:

Unable to handle kernel NULL pointer dereference at virtual address 0000017c
printing eip:
e08ac8bd
*pde = 00000000

Entering kdb (current=0xc7dbe000, pid 2513) on processor 0 Panic: Oops
due to panic @ 0xe08ac8bd
eax = 0xce39a400 ebx = 0x00000000 ecx = 0x0000000f edx = 0xc02daba0
esi = 0x00000282 edi = 0x00000000 esp = 0xc7dbff5c eip = 0xe08ac8bd
ebp = 0xc7dbff68 xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010082
xds = 0xc02d0018 xes = 0x00000018 origeax = 0xffffffff &regs = 0xc7dbff28

and then the KDB prompt appears. Back when I had my emu10k problems,
Keith Owen told me to (in a nutshell), go through the running processes
and "btp" their PID and look for "text.lock". Back then, I'd eventually
find the process with the "text.lock", gather some more information, and
send it to this list where somebody would almost always be able to tell
me what the problem is.

Needless to say, I'd like to be able to do this again, but I can't find
any process with "text.lock". When the crashes would first happen, I'd
go through the whole process table looking for the "text.lock" and not
find any. This could take quite a while, but I'd repeat it thinking I
was missing something somewhere, but after doing this a few times, I
started to realize that it might not be me.

Another thing I noticed the last time I tried to use KDB to hunt down a
problem was that the process with "text.lock" would usually be the
process marked with a 1 in the "*" column of the process list. This
time, though, I'm still finding only on process marked with a 1 in the
"*" column (klogd), but there's no "text.lock".

Finally, if I run what's dumped to the console (like what I showed
above) through ksymoops, it just returns the same text with no further
expansion. (I guess this would make sense since "oops" never appears in
the text of my dump, so I guess it's not an oops, eh?)

Anyway, I'd like to try to track this bug down further -- it's making me
pull out what little hair I have =) -- but I don't know what to do next,
so I implore you, o great kernel hackers, to impart some of your
knowledge upon me so that I may debug too. Any tips would be good too.
=;]

FWIW, I'm running 2.4.2 with the KDB patch, modutils-2.4.2, binutils
2.10.0.33, util-linux 2.10s, e2fsprogs 1.19, GNU make 3.79, and RedHat's
kgcc (egcs 2.91.66).

thanks,
pete

Attachments:

(No filename) (3.42 kB)
(No filename) (232.00 B)
Download all attachments

2001-02-25 23:35:53

by Keith Owens

[permalink] [raw]

Subject: Re: Mysterious lockups with 2.4.X (Help w/KDB)

On Sun, 25 Feb 2001 13:39:16 -0500,
Pete Toscano <[email protected]> wrote:
>Unable to handle kernel NULL pointer dereference at virtual address 0000017c
> printing eip:
>and then the KDB prompt appears. Back when I had my emu10k problems,
>Keith Owen told me to (in a nutshell), go through the running processes
>and "btp" their PID and look for "text.lock".

Those instructions were for a lockup. This problem is not a lockup, it
is a bad pointer. In this case you only care about the current
process. Backtrace (bt) will show what the process was doing. 'id
%eip' will disassemble the failing code.