2000-12-10 23:59:14

by Johnathan Corgan

[permalink] [raw]
Subject: Difficult to diagnose oops in 2.2.17, details at www.aeinet.com/oops

(Please cc: me directly if you so choose, I am not subscribed to l-k at the
moment. Thanks. Otherwise, I'll follow the thread in a l-k archive.)

All,

I'd like to enlist your help in diagnosing a series of oopsen over the last
eight weeks. I didn't start tracking the problem in detail until a week ago,
but you can now find the log files, ksymoops output, SysRq dumps, and other
files at:

http://www.aeinet.com/oops

Basically, the system is a stock Debian 2.2 (potato) install on an Intel
system, with a custom compiled kernel. The motherboard is a dual-cpu Intel
N440BX server board with single PIII (katmai) 500 cpu, 256 MB RAM, and is
very lightly loaded most of the time.

It had been stable for approximately nine months on 2.2.14, then began to
exhibit a variety of strange oops and crashes about eight weeks ago.
Generally, the oops would result in a hard system lock up, where only a reset
would return the box to normal. Some of these oops made it into
/var/log/messages, and ksymoops was run on that. The output is one of the
links at the URL above. The average uptime at this point was about an hour or
two, though sometimes it was only five minutes.

My first guess was hardware, specifically memory. Running memcheck86 on the
system found serious faults, so the memory was replaced with a new module
that passed memcheck86 with flying colors. The result was that the system
would stay up approximately one or two days at a time, but the oops and
lockups were still occurring.

Since it was taking about 15 minutes to reboot, I installed and converted to
reiserfs (3.5.57), and upgraded to 2.2.17 at the same time. The kernel is
now 2.2.17 + crypto (10) + reiserfs (3.5.57) + badram (3.3). No change in the
symptoms happened, and I was resetting the system manually every day or so.

Fast forward to about a week ago, when I decided that I should seek community
help. As an incentive, I'll ship one six-pack of beer anywhere in the world
to the first person who solves this problem permanently. Brand of beer is
negotiable--but I only have access to cheap American beer :)

See the web log for the (unsuccessful) efforts taken so far suggested by the
helpful people at #linpeople and #kernelnewbies on irc.debian.org.

I still suspect it's hardware, though I have no idea what it might be now.
Christmas break brings a new system to replace it, so it will be moot by
then, but I'm still determined to get to the bottom of it anyway.

Thanks for your time.

Johnathan Corgan
Atlas Enterprises Internet